Recent years have seen an explosive growth
in biological data, which is often not published anymore
in a conventional sense, but deposited in a database.
Sequence data from mega-sequencing projects may not
even be linked to a conventional publication. This trend
and the need for computational analyses of the data
made databases essential tools for biological research.
The goal of this material is to describe the different
molecular biology databases available to researchers.
There are so many specialised databases, that it is
not reasonable to list the URLs of all of them, especially
since this category of databases is quite changeable
and any list provided here would soon be outdated.
However, under the URL http://www.expasy.ch/alinks.htmlyou
will find a WWW document that lists information sources
for molecular biologists, which is kept constantly up-to-date.
Services that abstract the scientific literature
began to make their data available in machine-readable
form in the early 1960. You should be aware that none
of the abstracting services has a complete coverage.
The best known is "MEDLINE", and now "PUBMED",
abstracting mainly the medical literature.
is best accessible through NCBI's ENTREZ (http://www.ncbi.nlm.nih.gov/Entrez/).
EMBASE is a commercial product for the medical literature.
the inheritor of the old Biological Abstracts, covers
a broad biological field; the Zoological Record indexes
the zoological literature.
CAB International (http://www.cabi.org/)
maintains abstract databases in the fields of agriculture
and parasitic diseases. AGRICOLA is for the agricultural
field what MEDLINE is for the medical field (http://www.nalusda.gov/general_info/agricola/agricola.html).
The bibliographical databases are with the exception
of MEDLINE/PUBMED only available through commercial
Taxonomic databases are rather controversial
since the soundness of the taxonomic classifications
done by one taxonomist will be directly questioned by
Various efforts are going on to create a taxonomy resource
(e.g. "The Tree of Life" project (http://phylogeny.arizona.edu/tree/life.html),
"Species 2000" (http://www.sp2000.org/),
International Organization for Plant Information (http://iopi.csu.edu.au/iopi/),
Integrated Taxonomic Information System (http://www.itis.usda.gov/itis/),
etc.). The most generally useful taxonomic database
is that maintained by the NCBI (http://www.ncbi.nlm.nih.gov/Taxonomy/).
This hierarchical taxonomy is used by the Nucleotide
Sequence Databases, SWISS-PROT and TrEMBL, and is curated
by an informal group of experts.
The International Nucleotide Sequence Database
Collaboration (often, though inaccurately, referred
to as "GenBank") is a joint production of
the nucleotide sequence database by the DDBJ (DNA Data
Bank of Japan, http://www.ddbj.nig.ac.jp/),
EBI (European Bioinformatics Institute, http://www.ebi.ac.uk/),
and NCBI National Center for Biotechnology Information,
In Europe, the vast majority of the nucleotide sequence
data produced is collected, organised and distributed
by the EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl.html)
located at the European Bioinformatics Institute (Cambridge,
UK), an Outstation of the European Molecular Biology
Laboratory (EMBL) in Heidelberg, Germany. The nucleotide
sequence databases are data repositories, accepting
nucleic acid sequence data from the community and making
it freely available. The databases strive for completeness,
with the aim of recording every publicly known nucleic
acid sequence. These data are heterogenous, they vary
with respect to the source of the material (e.g. genomic
versus cDNA), the intended quality (e.g. finished versus
single pass sequences), the extent of sequence annotation
and the intended completeness of the sequence relative
to its biological target (e.g. complete versus partial
coverage of a gene or a genome). EMBL, NCBI and DDBJ
automatically update each other every 24 hours with
the new sequences they collected or updated. The result
is that they contain exactly the same information, except
for sequences that have been added in the last 24 hours.
Each entry in a database must have a unique identifier
that is a string of letters and/or numbers that only
that record has. This unique identifier, which is known
as the accession number, can be quoted in the scientific
literature, as it will never change. As the accession
number must always remain the same, another code is
used to indicate the different versions due to sequence
corrections. You should therefore always take care to
quote both the unique identifier and the version number,
when referring to records in a nucleotide sequence database.
The AC (ACcession number) line in a nucleotide sequence
record lists the accession numbers associated with this
entry. The accession number consists of one letter followed
by five digits (X12345), or (more recently) two letters
followed by six digits (XY123456).
An example of an accession number line is shown below:
AC Y00321; J05348;
An accession number is dropped from the database only
when the data to which it was assigned have been completely
removed from the database.
The SV (Sequence Version) line contains the nucleotide
sequence identifier, which allows you to recognise the
sequence version of this record.
An example of a Sequence Version line is shown below:
The nucleotide sequence identifier is of the form of
'Accession.Version' (eg, AJ000012.1). The first part
is the never changing accession number, followed by
a period and a version number. The accession number
part will be stable, but the version part will be incremented
when the sequence changes.
Although the nucleotide sequence data are checked for
integrity and obvious errors by the data library staff,
the quality of the data is the responsibility of the
submitter. As a consequence, there are many errors in
the database: many sequence entries are either mislabelled,
contaminated, incompletely or erroneously annotated,
or contain sequencing errors. In addition, the database
is very redundant, in the sense that the same sequence
from the same organism may be included many times, simply
reflecting the redundancy of the original scientific
Sequence-cluster databases such as UniGene (http://www.ncbi.nlm.nih.gov/UniGene/)
and STACK (Sequence Tag Alignment and Consensus Knowledgebase,
address the redundancy problem by coalescing sequences
that are sufficiently similar that one may reasonably
infer that they are derived from the same gene.
Several specialised sequence databases are also available.
Some of these deal with particular classes of sequence,
the Ribosomal Database Project (RDP,http://rdp.life.uiuc.edu/index.html),
the HIV Sequence Database (http://hiv-web.lanl.gov/),
IMGT, the ImMunoGeneTics database (http://imgt.cnusc.fr:8104/textes/info.html);
others are focussing on particular features, such as
TRANSFAC for transcription factors and transcription
factor binding sites (http://transfac.gbf-braunschweig.de/TRANSFAC/index.html),
EPD (Eukaryotic Protein Database, ftp://ftp.ebi.ac.uk/pub/databases/epd/)
for promoters, and REBASE (http://rebase.neb.com/rebase/)
for restriction enzymes and restriction enzyme sites.
is a specialised database of organelle genomes. A database
for mitochondrial genomics is mitBASE (http://www3.ebi.ac.uk/Research/Mitbase/mitbase.pl)
from the EBI.
For organisms of major interest to geneticists,
there is a long history of conventionally published
catalogues of genes or mutations. In the past few years,
most of these have been made available in an electronic
form and a variety of new databases have been developed.
These various databases vary greatly in form and content;
varying in the classes of data captured and how these
data are stored.
There are several databases for Escherichia coli.
CGSC, the E. coli Genetic Stock Center, (http://cgsc.biology.yale.edu/top.html)
maintains a database of E.coli genetic information,
including genotypes and reference information for the
strains in the CGSC collection, gene names, properties,
and linkage map, gene product information, and information
on specific mutations. The E. coli Database collection
in Giessen, Germany, maintains curated gene-based sequence
records for E. coli. EcoCyc http://ecocyc.PangeaSystems.com/ecocyc/ecocyc.html),
the "Encyclopedia of E. coli Genes and Metabolism"
is a database of E. coli genes and metabolic
The MIPS yeast database (http://www.mips.biochem.mpg.de/proj/yeast/)
is an important resource for information on the yeast
genome and its products.
The Saccharomyces Genome Database (http://genome-www.stanford.edu/Saccharomyces/)
is another major yeast database.
MaizeDB is the database for genetic data on maize
AGIS (Agricultural Genome Information System, http://probe.nalusda.gov:8000/index.html).
provides for other plants access to many different genome
databases (mostly in ACEDB format), including Chlamydomonas,
cotton, alfalfa, wheat, barley, rye, rice, millet,
sorghum and species of Solanaceae and trees. MENDEL
is a plant-wide database for plant genes (http://jiio6.bbsrc.ac.uk/).
ACeDB is the database for genetic and molecular data
concerning Caenorhabditis elegans. The database
management system written for ACeDB by R Durbin and
J Thierry-Mieg has proved very popular and has been
used in many other species-specific databases. ACEDB
(spelled with a capital ‘E') is now the name of this
database management system, resulting in some confusion
relative to the C. elegans database. The entire
database can be downloaded from the Sanger Institute (http://www.sanger.ac.uk/Projects/C_elegans/).
Two of the best-curated genetic databases are FlyBase
the database for Drosophila melanogaster and
the Mouse Genome Database (MGD, http://www.informatics.jax.org/).
ZFIN, a database for another important model organism,
the zebrafish Brachydanio rerio, has been implemented
There are also genetic databases available for several
animals of economic importance to humans. These include
pig (PIGBASE), cows (BovGBASE), sheep (SheepBASE) and
chicken (ChickBASE). In addition, there is a database
of mutant phenotypes modeled on Mendelian Inheritance
in Man, Mendelian Inheritance in Animals.
All these databases are available via the AGIS server
and most from the Roslin Institute server (http://www.ri.bbsrc.ac.uk/bioinformatics/databases.html)
and from the Japanese Animal Genome Database (http://ws4.niai.affrc.go.jp/).
Two major databases for human genes and genomics are
in existence. V McKusick's Mendelian Inheritance in
Man (MIM) is a catalogue of human genes and genetic
disorders and is available in an online form (OMIM,
the NCBI. The Genome Database (GDB, http://www.gdb.org/)
is the major human genome database including both molecular
and mapping data.
Both OMIM and GDB include information on genetic variation
in humans but there is also the human mutation server
at the EBI (http://www.ebi.ac.uk/mutations/index.html),
with links to the many single sequence variation databases
at the EBI; and to the SRS (Sequence Retrieval System)
interface to many human mutation databases.
The GeneCards resource at the Weizmann Institute (http://bioinfo.weizmann.ac.il/cards/)
integrates information about human genes from a variety
of databases, including GDB, OMIM, SWISS-PROT and the
nucleotide sequence databases.
also provides a database of human genes, with links
to diseases and maps.
A parasite genome database (http://www.ebi.ac.uk/parasites/parasite-genome.html)
is supported by the World Health Organisation (WHO)
at the EBI, covering the five ‘targets' of its Tropical
Diseases Research programme: Leishmania, Trypanosoma
cruzi, African Trypanosomes, Schistosoma
and Filariasis. Databases for some vectors of parasitic
diseases are also available, such as AnoDB (http://konops.imbb.forth.gr/AnoDB/)
for Anopheles and AaeDB (http://klab.agsci.colostate.edu/)
for Aedes aegypti.
The protein sequence databases are the most
comprehensive source of information on proteins. It
is necessary to distinguish between universal databases
covering proteins from all species and specialised data
collections storing information about specific families
or groups of proteins, or about the proteins of a specific
organism. Two categories of universal protein sequence
databases can be discerned: simple archives of sequence
data; and annotated databases where additional information
has been added to the sequence record. In the following
you will find a short description of the Protein Information
Resource (PIR), the oldest protein sequence database;
and a more detailed description of SWISS-PROT, an annotated
universal sequence database; and of TrEMBL, the supplement
of SWISS-PROT, which can be classified as a computer-annotated
sequence repository. There will be furthermore a discussion
of the issues of completeness and redundancy, and finally
some examples of specialised protein sequence collections.
The Protein Information
PIR (Barker et al., 1999) was established
in 1984 by the National Biomedical Research Foundation
(NBRF) as a successor of the original NBRF Protein Sequence
Database, developed over a 20 year period by the late
Margaret O. Dayhoff and published as the `Atlas of Protein
Sequence and Structure' (Dayhoff et al., 1965;
Dayhoff, 1979). Since 1988 the database has been maintained
by PIR-International, a collaboration between the NBRF,
the Munich Information Center for Protein Sequences
(MIPS), and the Japan International Protein Information
The PIR release 60.10 (June 15, 1999) contained 131,026
entries. The database is partitioned into four sections,
PIR1 (14,753 entries), PIR2 (115,383 entries), PIR3
(560 entries) and PIR4 (330 entries). Entries in PIR1
are fully classified by superfamily assignment, fully
annotated and fully merged with respect to other entries
in PIR1. The annotation content as well as the level
of redundancy reduction varies in PIR2 entries. Many
entries in PIR2 are merged, classified, and annotated.
Entries in PIR3 are not classified, merged or annotated.
PIR3 serves as a temporary buffer for new entries. PIR4
was created to include sequences identified as not naturally
occurring or expressed, such as known pseudogenes, unexpressed
ORFs, synthetic sequences, and non-naturally occurring
fusion, crossover or frameshift mutations.
PIR provides also some degree of cross-referencing
to other biomolecular databases by linking to the DDBJ/EMBL/GenBank
nucleotide sequence databases, PDB, GDB, FlyBase, OMIM,
SGD, and MGD.
Introduction. SWISS-PROT (Bairoch and Apweiler,
1999) is an annotated protein sequence database established
in 1986 and maintained collaboratively by the Swiss
Institute of Bioinformatics and the EMBL Outstation
- The European Bioinformatics Institute (EBI). It strives
to provide a high level of annotation, a minimal level
of redundancy, a high level of integration with other
biomolecular databases as well as extensive external
documentation. Each entry in SWISS-PROT gets thoroughly
analysed and annotated by biologists ensuring a high
standard of annotation and maintaining the quality of
the database (Apweiler et al., 1997). SWISS-PROT
contains data that originates from a wide variety of
organisms; release 38 (July 1999) contained around 80'000
annotated sequence entries from more than 6000 different
species. But half of the entries come from about 20
organisms, which are the target of many biological studies
(ranked by number of entries): Homo sapiens, Saccharomyces
cerevisiae, Escherichia coli, Mus musculus, Rattus norvegicus,
Bacillus subtilis, Caenorhabditis elegans, Haemophilus
influenzae, Schizosaccharomyces pombe, Methanococcus
jannaschii, Bos taurus, Drosophila melanogaster, Mycobacterium
tuberculosis, Gallus gallus, Arabidopsis thaliana, Salmonella
typhimurium, Xenopus laevis, Synechocystis sp. (strain
PCC 6803), Sus scrofa, and Oryctolagus cuniculus.
A close look at a SWISS-PROT entry. A sample
SWISS-PROT entry is shown in Figure
1.The SWISS-PROT entries are made up of different
line types, each of them beginning with a two-character
line code indicative of the type of data stored in the
line. There are 22 different line types in SWISS-PROT.
Some line types may occur more than once in an entry
and some entries do not contain all line types. Let
us have a close look on the entries in Figure
1 to explain the different information found
in the different lines:
DT 01-APR-1993 (Rel. 25, Created)
DT 01-APR-1993 (Rel. 25, Last sequence update)
DT 15-JUL-1999 (Rel. 38, Last annotation
DE CD40 LIGAND (CD40-L) (TNF-RELATED ACTIVATION
PROTEIN) (TRAP) (T CELL
DE ANTIGEN GP39) (CD154 ANTIGEN).
GN TNFSF5 OR CD40LG OR CD40L OR TRAP.
OS Homo sapiens (Human).
OC Eukaryota; Metazoa; Chordata; Craniata;
OC Eutheria; Primates; Catarrhini; Hominidae;
The identification line (ID) is the first
line in every SWISS-PROT entry. It contains the entry
name, which provides an easy way of labelling an entry.
In our example, TNF5_HUMAN is the entry name for the
human CD40 ligand; while P29965 is its accession number,
shown in the AC (ACession) line(s). For reasons of consistency
it is sometimes necessary to change entry names from
one release of the database to another. Accession numbers
provide an unambiguous way to refer to sequence entries
and should be always used if you need to cite a particular
entry in a citation, since they never change! It sometimes
happens that the AC line contains more than one accession
number. In this case you should always cite the first
one, the so-called "primary accession number".
The three DaTe (DT) lines, which follow the AC line,
show you when the entry was created, when the sequence
was updated the last time and when the most recent annotation
The DE (Description) line(s) lists all the names under
which a particular protein is or has been known. The
next line, the GN (GeneName) line lists the designation(s)
of the protein's gene. This line can be absent if no
gene name has been given, or it can be quite extensive,
like for some DE lines, if multiple symbols have been
assigned by different groups. The DE line gives also
in indication about the characterisation of the protein.
Our example describes the protein as ‘CD40 LIGAND'.
That means that this protein has been experimentally
characterised to be the ‘CD40 LIGAND'. With the increasing
amount of data coming from mega-sequencing projects
you will find more and more proteins in SWISS-PROT with
no experimental characterisation. These proteins can
be identified through their standardised labeling of
the DE line.
When a protein exhibits extensive sequence similarity
to a characterised protein and/or has the same conserved
regions then the label ‘probable' is used in
the DE line. It is normally followed by the full name
of a protein from the same family that it matches.
DE PROBABLE 5'-NUCLEOTIDASE PRECURSOR (EC
The label ‘putative' is used in the DE line
of proteins that exhibit limited sequence similarity
to characterised proteins. These proteins often have
a conserved site e.g. ATP-binding site but no other
significant similarity to a characterised protein. It
is most frequently used for sequences from genome projects.
DE PUTATIVE AMINO-ACID PERMEASE.
The assignment of the labels ‘probable' and
‘putative' is dependent primarily on the results
of sequence similarity searches against SWISS-PROT.
It is important to point out here that no specific cut-off
point is used to assign a protein as ‘putative' or ‘probable',
i.e. it is not the case that <50% identity = putative
and >50% = probable. Let us take Q10480, a predicted
Schizosaccharomyces pombe protein, as an example.
This entry has the following description line:
DE PROBABLE MITOCHONDRIAL NUCLEASE (EC
The FastA results show that the sequence is 47% identical
over the entire length to the mitochondrial nuclease
101233036 residues in 321608 sequences
statistics extrapolated from 50000 to 321410 sequences
Expectation_n fit: rho(ln(x))= 5.8023+/-0.00053; mu=
mean_var=70.4844+/-13.963, 0's: 144 Z-trim: 31 B-trim:
1593 in 1/64
FASTA (3.2 December, 1998) function [optimised, +1/-3
matrix (15:-5)] ktup:2
join: 37, opt: 25, gap-pen: -12/ -2, width: 16 reg.-scaled
Scan time: 115.367
The best scores are: initn init1 opt z-sc E(321410)
SW:NUC1_YEAST P08466 MITOCHONDRIAL NU ( 329) 941 630
1017 1216.7 1.9e-60
>>SW:NUC1_YEAST P08466 MITOCHONDRIAL NUCLEASE
(EC 3.1 (329 aa)
initn: 941 init1: 630 opt: 1017 Z-score: 1216.7 expect()
Smith-Waterman score: 1017; 47.147% identity in 333
aa overlap (1-326:1-325)
Large segments contain identical residues, the E-value
(the assessment of the statistical significance based
upon the extreme value distribution) of the alignment
is statistically highly significant, the active site
is conserved and so we tentatively classify it as a
‘PROBABLE MITOCHONDRIAL NUCLEASE‘.
All predicted protein sequences lacking any significant
sequence similarity to characterised proteins are labeled
as label ‘hypothetical proteins'. The majority
of these cases come from the genome sequencing projects.
DE HYPOTHETICAL 33.8 KD PROTEIN C5H10.01
IN CHROMOSOME I.
The next lines, the OS (Organism Species) and OC (Organism
Classification) lines, describe the species from which
the protein has been derived. The OS line shows the
scientific name of the organism and, if existing, the
common English name. The OC lines give the taxonomic
tree. SWISS-PROT, as well as the DDBJ/EMBL/GenBank nucleotide
sequence databases, uses the NCBI taxonomy to standardise
the taxonomies of the molecular sequence databases.
A line not present in our example is the OG (OrGanelle)
line. This line is used to indicate in what organelle
or extrachromosomal element the gene is encoded.
The next part of our sample entry contains various
RP SEQUENCE FROM N.A.
RX MEDLINE; 93076854.
RA GRAF D., KORTHAEUER U., MAGES H.W., SENGER
G., KROCZEK R.A.;
RT "Cloning of TRAP, a ligand for CD40
on human T cells.";
RL Eur. J. Immunol. 22:3191-3194(1992).
.. 6 references omitted
RP X-RAY CRYSTALLOGRAPHY (2.0 ANGSTROMS)
RX MEDLINE; 96131874.
RA KARPSUSAS M., HSU Y.-M., WANG J.-H.,
THOMPSON J., LEDERMAN S.,
RA CHESS L., THOMAS D.;
RT "2-A crystal structure of an extracellular
fragment of human CD40
RL Structure 3:1031-1039(1995).
RP 3D-STRUCTURE MODELING OF COMPLEX WITH
RX MEDLINE; 98266353.
RA SINGH J., GARBER E., VAN VLIJMEN H.,
KARPSUSAS M., HSU Y.-M.,
RA ZHENG Z., NAISMITH J.H., THOMAS D.;
RT "The role of polar interactions
in the molecular recognition of CD40L
RT with its receptor CD40.";
RL Protein Sci. 7:1124-1135(1998).
.. 6 references omitted
RP VARIANTS HIGM1 ARG-36; CYS-140; SER-231;
MET-254 AND GLY-227 DEL.
RX MEDLINE; 97295077.
RA NONOYAMA S., SHIMADZU M., TORU H., SEYAMA
K., NUNOI H., NEUBAUER M.,
RA YATA J.-I., OCH H.D.;
RT "Mutations of the CD40 ligand gene
in 13 Japanese patients with
RT X-linked hyper-IgM syndrome.";
RL Hum. Genet. 99:624-627(1997).
Each reference is a block of lines starting
with ‘R': RN, RP, RX, RA, RT and RL. The RN (Reference
Number) line gives simply the number of the reference
in an entry. The RP line provides a short indication
of the work described in the publication. In the RC
(Reference Comment) line you will find information such
as the tissue or strain from which the protein was extracted.
The references shown above have no RC lines, so some
examples to illustrate the type of information you can
find in RC lines:
RC STRAIN=BALB/C; TISSUE=BRAIN
The RX line - ‘X' for Cross-reference – is
used for the identifier assigned to a specific reference
in a bibliographic database like Medline. The RA (Reference
Author) line mentions the authors of the citation, the
RT (Reference Title) line contains the title and the
RL (Reference Location) line the conventional citation
information of the reference.
You can see in our example that SWISS-PROT includes
in addition to citations about sequencing work also
references to other scientific work like 3-D structure
determination, mutagenesis, and detection of post-translational
modifications and variants. It is also important to
know that you will find not only references to published
journal articles, books and theses in SWISS-PROT, but
also to information directly submitted to the database.
Many scientific data are not published anymore in the
conventional sense. It has already been some years since
most journals have declined to publish sequence data
– these are now simply deposited in the sequence databases.
Sequence data from the mega-sequencing projects may
not even be linked to conventional publications. There
is an increasing trend for other classes of data to
be published only in a database. It is important to
be aware of these developments and to realise that biomolecular
databases are becoming much more than a repository of
data that can be found elsewhere.
Continuing in the sample entry we arrive at the following
CC -!- FUNCTION: MEDIATES B-CELL PROLIFERATION
IN THE ABSENCE OF CO-
CC STIMULUS AS WELL
AS IGE PRODUCTION IN THE PRESENCE OF IL-4.
CC INVOLVED IN IMMUNOGLOBULIN
CC -!- SUBUNIT: HOMOTRIMER.
CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE
PROTEIN. ALSO EXISTS AS AN
CC -!- TISSUE SPECIFICITY: SPECIFICALLY
EXPRESSED ON ACTIVATED CD4+
CC -!- DISEASE: DEFECTS IN CD40LG ARE THE
CAUSE OF AN X-LINKED
WITH HYPER-IGM (HIGM1), AN IMMUNOGLOBULIN ISOTYPE
CC SWITCH DEFECT
CHARACTERISED BY ELEVATED CONCENTRATIONS OF SERUM
CC IGM AND DECREASED
AMOUNTS OF ALL OTHER ISOTYPES. AFFECTED MALES
CC PRESENT AT AN
EARLY AGE (USUALLY WITHIN THE FIRST YEAR OF LIFE)
CC RECURRENT BACTERIAL
AND OPPORTUNISTIC INFECTIONS, INCLUDING
CARINII PNEUMONIA AND INTRACTABLE DIARRHEA DUE TO
INFECTION. DESPITE SUBSTITUTION TREATMENT WITH
CC INTRAVENOUS IMMUNOGLOBULIN,
THE OVERALL PROGNOSIS IS RATHER POOR,
CC WITH A DEATH
RATE OF ABOUT 10% BEFORE ADOLESCENCE.
CC -!- SIMILARITY: BELONGS TO THE TUMOR
NECROSIS FACTOR FAMILY.
CC -!- DATABASE: NAME=CD40Lbase;
CD40L defect database (mutation db);
CC -!- DATABASE: NAME=PROW; NOTE=CD guide
The CC (Comments) lines contain various textual
comments grouped under different topics. There are altogether
20 different topics. The current topics and their definitions
are listed in the table below.
Description of the existence of related protein
sequence(s) produced by alternative splicing
of the same gene or by the use of alternative
Description of the reaction(s) catalyzed by
This topic warns you about possible errors
and/or grounds for confusion
Description of an enzyme cofactor
Description of a cross-reference to a network
database/resource for a specific protein
Description of the developmental specific expression
of a protein
Description of the disease(s) associated with
a deficiency of a protein
Description of the domain structure of a protein
Description of an enzyme regulatory mechanism
General description of the function(s) of a
Description of the compound(s) which stimulate
the synthesis of a protein
Reports the exact molecular weight of a protein
or part of a protein as determined by mass spectrometric
Any comment which does not belong to any of
the other defined topics
Description of the metabolic pathway(s) to
which a protein is associated
Description of polymorphism(s)
Description of a post-translational modification
Description of the similaritie(s) (sequence
or structural) of a protein with other proteins
Description of the subcellular location of
the mature protein
Description of the quaternary structure of
Description of the tissue specificity of a
The CC lines give, as the DE lines, an indication about
the level of characterisation of a protein. In our example
you can find experimentally verified information about
the ‘FUNCTION', the quartenary structure (‘SUBUNIT'),
the ‘SUBCELLULAR LOCATION' and the ‘TISSUE SPECIFICITY'of
the protein. You also find a description of the ‘DISEASE(s)'known
to be associated with a deficiency of the protein, a
description of the ‘SIMILARITY'of the protein with other
proteins, and a cross-reference to network ‘DATABASE'resource(s)
for this specific protein.
Let us have again a look at Q10480, the ‘PROBABLE
MITOCHONDRIAL NUCLEASE' of Schizosaccharomyces pombe,
as an example for a protein without biochemical characterisation.
It has been mentioned before that the sequence is 47%
identical over the entire length to the biochemically
characterised mitochondrial nuclease from Saccharomycescerevisiae;
and so it was tentatively classified as a mitochondrial
nuclease. In Q10480 you can find the following CC lines:
CC -!- FUNCTION: THIS ENZYME HAS BOTH
RNASE AND DNASE ACTIVITY (BY
CC -!- COFACTOR: REQUIRES MANGANESE OR MAGNESIUM
CC -!- SUBUNIT: HOMODIMER (BY SIMILARITY).
CC -!- SUBCELLULAR LOCATION: MITOCHONDRIAL
INNER MEMBRANE (POTENTIAL)
CC -!- SIMILARITY: BELONGS TO THE DNA/RNA
The function, cofactor and subunit comments
are all labelled ‘by similarity'. This indicates that
these have been assigned due to similarity to an existing
characterised entry, in this case the mitochondrial
nuclease from Saccharomycescerevisiae. The label ‘potential'
is also used to indicate the assignment by comparative
analysis. In general this label is used if there is
no experimental proof for the information given in a
CC topic for a protein, but similarity searches or other
prediction methods allow potential comments (in the
example of Q10480 about the subcellular location). If
comparative analysis reveals highly likely comments,
then the label ‘probable' is used:
CC -!- SUBUNIT: HOMOTRIMER (PROBABLE).
There is one more type of CC line, which has not yet
been explained with the other CC lines, and that is
the CC block with the Copyright statement:
CC This SWISS-PROT entry is copyright. It
is produced through a collaboration
CC between the Swiss Institute of
Bioinformatics and the EMBL outstation -
CC the European Bioinformatics Institute.
There are no restrictions on its
CC use by non-profit institutions
as long as its content is in
CC modified and this statement is not removed.
Usage by and for commercial
CC entities requires a license agreement
CC or send an email to email@example.com).
Some background information about this very
special type of CC lines:
The enormous growth in the quantity of sequence and
characterisation data has made the task of producing
an annotated and comprehensive protein sequence database
a major challenge. While automation of some aspects
of this work has made it possible to obtain significant
progress in productivity, it nonetheless remains a task
which is intensive in terms of human resources, and
which requires an increasing amount of expertise. Recent
years have shown that public funding for such an activity
is not going to keep pace with its financial requirements.
During the same period, the importance of high quality
annotation for all kinds of life sciences research activities
has grown. We are therefore faced with the paradoxical
situation where no major life sciences research lab
can function without a database such as SWISS-PROT,
yet the existence and continued development of such
a resource is in jeopardy. SWISS-PROT decided that the
only feasible solution to this problem is to obtain
additional funds through the payment of yearly license
fees by non-academic users for access to SWISS-PROT.
The copyright statement should remind commercial users
of their obligation to contribute to the further development
of SWISS-PROT by concluding a license agreement.
The groups in charge for the production of SWISS-PROT
at EMBL and at the Swiss Institute of Bioinformatics
announced in July 1998 that they would request license
fees from commercial users in order to raise revenues,
which would be used entirely to improve SWISS-PROT.
Today, nearly a year later, we are in a position to
take stock: Academic access to SWISS-PROT, and its use
and redistribution, has not been affected, and we are
beginning to see quality improvements resulting from
the extra resources raised. Indeed, even in the commercial
sector, aside from requests for subscriptions to be
paid, nothing has changed in the way that SWISS-PROT
is made available. Companies are showing their appreciation
of the work done in the scientific curation of the scientific
information in SWISS-PROT. The major pharmaceutical
industries have signed, or are in the process of signing,
license agreements. Smaller companies are starting to
The producers of SWISS-PROT would have welcomed a survival
plan for SWISS-PROT funded by public bodies and uncomplicated
by subscriptions. However, Europe was organisationally
unable to come up with the goods. The current pragmatic
expedient to raise revenues has solved the problem for
SWISS-PROT while avoiding commercialisation, and for
that the users of SWISS-PROT are thankful.
But now back to the scientific content of the SWISS-PROT
database. The next section contains the DR (Database
DR EMBL; X68550; CAA48554.1; -.
DR EMBL; Z15017; CAA78737.1; -.
DR EMBL; X67878; CAA48077.1; -.
DR EMBL; L07414; AAA35662.1; -.
DR EMBL; D31797; BAA06599.1; -.
DR EMBL; D31793; BAA06599.1; JOINED.
DR EMBL; D31794; BAA06599.1; JOINED.
DR EMBL; D31795; BAA06599.1; JOINED.
DR EMBL; D31796; BAA06599.1; JOINED.
DR PIR; S25684; S25684.
DR PIR; S26694; S26694.
DR PIR; S28017; S28017.
DR PIR; S28852; S28852.
DR PIR; JH0793; JH0793.
DR PDB; 1ALY; 17-SEP-97.
DR MIM; 308230; -.
DR PROSITE; PS00251; TNF_1; 1.
DR PROSITE; PS50049; TNF_2; 1.
DR PFAM; PF00229; TNF; 1.
The DR lines link SWISS-PROT to other biomolecular
databases. SWISS-PROT is currently linked to 29 different
databases. In the example above you see links to 19
different entries in six different databases. The cross-references
allow users to navigate to linked databases in order
to retrieve part or all of the related information.The
format of a DR line, except for cross-references to
PROSITE (Hofmann et al., 1999), Pfam (Bateman
et al., 1999), and the EMBL nucleotide sequence
databases (Stoesser et al., 1999), is the following:
DR DATABASE_IDENTIFIER; PRIMARY_ IDENTIFIER;
The database identifier is the
name of the database that contains the linked entry.
The primary identifier (in most cases the accession
number) is the entry's primary key, while the secondary
identifier complements the information given by the
first identifier. The currently linked databases are
Nucleotide sequence database of EMBL (EBI)
Dictyostelium discoideum genome database
Escherichia coli gene-protein database (2D
gel spots) (ECO2DBASE)
Escherichia coli K12 genome database (EcoGene)
Drosophila genome database (FlyBase)
G-protein--coupled receptor database (GCRDb)
HIV sequence database
Harefield hospital 2D gel protein databases
Homology-derived secondary structure of proteins
Maize genome database (MaizeDB)
Maize genome 2D Electrophoresis database (Maize-2DPAGE)
Plant gene nomenclature database (Mendel)
Mouse genome database (MGD)
Mendelian Inheritance in Man Database (MIM)
Brookhaven Protein Data Bank (PDB)
Pfam protein domain database
Protein sequence database of the Protein Information
PROSITE protein domains and families database
Restriction enzyme database (REBASE)
Human keratinocyte 2D gel protein database
from Aarhus and Ghent universities
Saccharomyces Genome Database (SGD)
Salmonella typhimurium LT2 genome database
Bacillus subtilis 168 genome database (SubtiList)
Human 2D Gel Protein Database from the University
of Geneva (SWISS-2DPAGE)
The bacterial database(s) of 'The Institute
of Genome Research' (TIGR)
Transcription factor database (TRANSFAC)
Caenorhabditis elegans genome sequencing project
protein database (WormPep)
Yeast electrophoresis protein database (YEPD)
Zebrafish Information Network genome database
The specific format for cross-references to the EMBL
nucleotide sequence database is:
DR EMBL; ACCESSION_NUMBER; PROTEIN_ID;
The secondary identifier is here the ‘PROTEIN_ID',
which stands for the ‘Protein Sequence Identifier'.
It is a string which is stored, in nucleotide sequence
entries, in a qualifier called ‘/protein_id' which is
tagged to every CDS in the nucleotide database.
FT CDS 302..2674
FT /product="RecA protein"
The Protein_ID consists of a stable ID portion (8 characters:
3 letters followed by 5 numbers) plus a version number
after a decimal point. The version number only changes
when the protein sequence coded by the CDS changes,
while the stable part remains unchanged.
The 'STATUS_IDENTIFIER'provides information about the
relationship between the sequence in the SWISS-PROT
entry and the CDS in the corresponding EMBL entry.
The specific format for cross-references to the PROSITE
and Pfam protein domain and family databases is:
DR PROSITE ¦ PFAM; ACCESSION_NUMBER; ENTRY_NAME;
‘ACCESSION_NUMBER'stands for the accession number of
the PROSITE or Pfam pattern, profile or HMM entry; ‘ENTRY_NAME'is
the name of the entry and 'STATUS'is one of the following:
‘n' is the number of hits of the pattern or profile
in that particular protein sequence. The ‘FALSE_NEG'
status indicates that while the pattern or profile did
not detect the protein sequence, it is a member of that
particular family or domain. The ‘PARTIAL' status indicates
that the pattern or profile did not detect the sequence
because that sequence is not complete and lacks the
region on which is the pattern/profile is based. Finally
the ‘UNKNOWN' status indicates uncertainties as to the
fact that the sequence is a member of the family or
domain described by the pattern/profile. Pfam cross-references
do not make use of the ‘FALSE_NEG' and ‘UNKNOWN' status.
After the DR lines you will find the KW (KeyWord) lines,
which list relevant keywords that can be used to retrieve
a specific subset of protein entries from the database:
KW Cytokine; Transmembrane; Glycoprotein;
KW Disease mutation; Polymorphism.
We now arrive at the FT (FeaTure) lines, which describe
regions or sites of interest in the sequence:
FT TRANSMEM 23
(TYPE-II MEMBRANE PROTEIN).
FT DISULFID 178
FT CARBOHYD 240
M -> R (IN H1GM1).
.. 15 FT lines omitted
In general the feature table lists post-translational
modifications, binding sites, active sites of an enzyme,
the secondary structure, sequence conflicts and variations,
signal sequences, transit peptides, propeptides, transmembrane
regions, and other characteristics.
The feature table gives the user, as the CC and DE
lines, an indication about the level of characterisation
of a protein. In the example above only the variants
are experimentally verified. Use of sequence similarity
searches and prediction programs have derived the other
features. If a feature is highly likely, then the label
‘probable' is used. The label ‘potential' is also used
to indicate the assignment by comparative analysis.
In our example it is known that this is a glycosylated,
disulfid bonds containing type II membrane protein,
but the correct topology of the protein, the glycosylation
site(s) and the disulfid bonds have not been experimentally
confirmed. The label ‘potential' is used to indicate
the predicted character of the information given in
the features ‘DOMAIN', ‘DISULFID', and ‘CARBOHYD'. Another
label used to indicate that a feature has not been experimentally
proven but only infered through sequence analysis is
FT ACT_SITE 142 142 BY SIMILARITY.
This example comes again from Q10480, the ‘PROBABLE
MITOCHONDRIAL NUCLEASE'of Schizosaccharomyces pombe,
which we used already a few times as an example for
a protein without biochemical characterisation. The
label ‘by similarity' indicates that this feature
has been assigned due to similarity to an existing characterised
entry, in this case the mitochondrial nuclease from
Now we are at the end of the in-depth view on a SWISS-PROT
entry and arrive at SQ (SeQuence header) line and the
SQ SEQUENCE 261 AA;
29273 MW; DC2AD21F CRC32;
MIETYNQTSP RSAATGLPIS MKIFMYLLTV
FLITQMIGSA LFAVYLHRRL DKIEDERNLH
EDFVFMKTIQ RCNTGERSLS LLNCEEIKSQ
FEGFVKDIML NKEETKKENS FEMQKGDQNP
QIAAHVISEA SSKTTSVLQW AEKGYYTMSN
NLVTLENGKQ LTVKRQGLYY IYAQVTFCSN
REASSQAPFI ASLCLKSPGR FERILLRAAN
THSSAKPCGQ QSIHLGGVFE LQPGASVFVN
VTDPSQVSHG TGFTSFGLLK L
Introduction.There is a tremendous increase
of sequence data due to technological advances (such
as sequencing machines), the use of new biochemical
methods (such as PCR technology) as well as the implementation
of projects to sequence complete genomes. These advances
have brought along an enormous flood of sequence information.
Maintaining the high quality of SWISS-PROT requires,
for each entry, a time-consuming process that involves
the extensive use of sequence analysis tools along with
detailed curation steps by expert annotators. It is
the rate-limiting step in the production of the database.
A supplement to SWISS-PROT was created in 1996, since
it is vital to make new sequences available as quickly
as possible without relaxing the high editorial standards
of SWISS-PROT. This supplement, TrEMBL (Translation
of EMBL nucleotide sequence database), consists of computer-annotated
entries derived from the translation of all coding sequences
(CDS) in the EMBL nucleotide sequence database, except
for those already included in SWISS-PROT. TrEMBL is
split in two main sections, SP-TrEMBL and REM-TrEMBL.
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries,
which should be eventually incorporated into SWISS-PROT.
REM-TrEMBL (REMaining TrEMBL) contains the entries that
will not get included in SWISS-PROT. In the following
you will find mainly a description of SP-TrEMBL. Therefore,
unless otherwise specified, the word "TrEMBL"
will stand for SP-TrEMBL in the rest of this chapter.
A typical TrEMBL entry is shown in Figure
2. As you can see, a TrEMBL entry looks very much
like a SWISS-PROT entry, since TrEMBL follows the SWISS-PROT
format and conventions as closely as possible. But there
are a few necessary differences affecting the ID and
It was already explained above that the very first
line of a SWISS-PROT entry is the ID line – ‘ID' for
identification - and is made of four different parts:
ID ANP_NOTCO STANDARD; PRT; 822 AA.
A TrEMBL 'ID' line is also made of four parts and looks
like this one:
ID Q12757 PRELIMINARY; PRT; 171 AA.
You can see that the SWISS-PROT and TrEMBL ID lines
differ in the first two parts of the ID line. The first
part is the entry name; ‘ANP_NOTCO‘in the case of the
SWISS-PROT example and ‘Q12757‘in the TrEMBL example.
The entry name used in all SP-TrEMBL entries is always
the same as the accession number of the entry. The entry
name used in REM-TrEMBL is the Protein_ID tagged to
the corresponding CDS in the EMBL Nucleotide Sequence
Database. To the right of the entry name you will find
either 'PRELIMINARY'(in the TrEMBL entry) or 'STANDARD'(in
the SWISS-PROT entry). The data class used in TrEMBL
is always 'PRELIMINARY'. That means that what you are
looking at is thoroughly checked by a computer but none
of the biologists curating SWISS-PROT and TrEMBL has
had time yet to read the necessary papers to finalise
There is a last difference between the SWISS-PROT and
TrEMBL entries, which affects the DT line (DaTe). The
syntax and definition of the DT lines that serve to
indicate when an entry was created and updated are identical
to that defined in SWISS-PROT; but the DT lines in TrEMBL
are referring to the TrEMBL release. The difference
is shown in the example below.
DT lines in a SWISS-PROT entry:
DT 01-JAN-1988 (Rel. 06, Created)
DT 01-JUL-1989 (Rel. 11, Last sequence update)
DT 01-AUG-1992 (Rel. 23, Last annotation
DT lines in a TrEMBL entry:
DT 01-NOV-1996 (TrEMBLrel. 01, Created)
DT 01-FEB-1997 (TrEMBLrel. 02, Last sequence
DT 01-JUN-1998 (TrEMBLrel. 06, Last annotation
The production of TrEMBL. To understand what
information you can find in TrEMBL, you need to have
some basic understanding of the TrEMBL production procedures.
The production of TrEMBL is illustrated in Figure
3It starts with the translation of coding sequences
(CDS) in the EMBL nucleotide sequence database. At this
stage all annotation you can find in a TrEMBL entry
comes from the corresponding EMBL entry. At the next
stage, the Post-processing phase, the redundancy in
TrEMBL gets reduced and additional annotation is automatically
added to bring TrEMBL entries closer to SWISS-PROT standard.
All EMBL nucleotide sequence database divisions are
regularly scanned for new or updated CDS features. These
are translated to "TrEMBLnew" entries, which
are in SWISS-PROT format. Each CDS leading to a correct
translation results in one entry whose ID is the Protein_ID
of the CDS. In the next step the original EMBL entries
are scanned to extract relevant data, to filter it and
eventually to insert it properly formatted into the
TrEMBLnew entry. Only bibliographic references relevant
to the given CDS are kept in the TrEMBLnew entry. This
is achieved by scanning the RP (Reference Position)
lines of the EMBL entry and matching with the CDS position
in the sequence. The RC (Reference Comment) line is
built by assigning the SWISS-PROT equivalent of the
following EMBL qualifiers:
- "/isolate","STRAIN=", (2nd choice)
- "/cultivar","STRAIN=CV. "
The description line (DE) comes from the /product qualifier
when present, otherwise the EMBL DE line, the /gene
and /note qualifiers are parsed. The EMBL DE line is
only considered if the EMBL entry contains only one
CDS and is stripped of non-pertinent information such
as the organism name, or phrases like 'complete CDS'.
The /gene qualifier is also used for the TrEMBLnew GN
line. In most cases these procedures lead to some sort
of informative DE line. However, in some cases the information
content of the corresponding EMBL entry is quite low
and the TrEMBLnew entry ends up with DE lines providing
nonsense information like:
DE PUTATIVE START AND STOP CODONS.
The EMBL keywords are included in the TrEMBLnew
entry, but only when they match a subset of SWISS-PROT
keywords which have the same meaning. Another condition
is that the EMBL entry has just one CDS so that no ambiguity
is possible. Some extra keywords derived from the features
and description lines are added. A subset of SWISS-PROT
features can be derived from the EMBL entry features.
- SIGNAL from sig_peptide
- TRANSIT from transit_peptide
- CHAIN from mat_peptide
- VARIANT from allele, variation, misc_difference
- CONFLICT from conflict
Two examples of TrEMBLnew entries, created in the way
described before, are shown in Figure
4 . In addition to this information parsed into
TrEMBLnew entries, data is put in the annotator's section
of the entry, which is not visible to the public. This
is used for further analysis both by programs and by
biologists and consists of:
- The EMBL entry description lines
- EMBL CC lines
- Bibliographic reference titles
- Full CDS feature text
- Full text of other relevant features within the
- Number of CDS in the EMBL entry
- The date of the last entry update
- Information if the organism already exists in SWISS-PROT
At this stage different types of TrEMBLnew entries
are put into different output files:
- CDS with a /dbxref="SWISS-PROT" or a /dbxref="SPTREMBL"
are not translated (already in SWISS-PROT + TrEMBL)
- CDS from mhc genes -> mhc.dat
- CDS from patent data -> patent.dat
- CDS from immunoglobulins and t-cell receptors ->
- CDS smaller than 8 amino acids -> smalls.dat
- CDS from artificial, synthetic or chimeric genes
- CDS from pseudogenes -> pseudo.dat
- remaining CDS -> stay in their relative taxonomic
Now the entries from the composite divisions of the
EMBL database (HTG, STS, EST, and UNC) are added to
their relative taxonomic TrEMBLnew divisions. Then all
files are searched for entries that have recently been
added to SWISS-PROT or TrEMBL and are thus missing a
/dbxref="SWISS-PROT" or a /dbxref="SPTREMBL"
qualifier in EMBL. These entries are removed. The entries
put in the files patent.dat, immuno.dat, smalls.dat,
synthetic.dat and pseudo.dat are now already at the
end of their production line. They are new entries in
REM-TrEMBL (REMaining TrEMBL), which contains the entries
(about 44'000 in release 10) that will not get included
in SWISS-PROT. This section is organised in five subsections:
- Immunoglobulins and T-cell receptors (file name
Immuno.dat): Most REM-TrEMBL entries are immunoglobulins
and T-cell receptors. The integration of further immunoglobulins
and T-cell receptors into SWISS-PROT has been stopped,
since SWISS-PROT does not want to add all known somatic
recombined variations of these proteins to the database.
At the moment there are more than 18'000 immunoglobulins
and T-cell receptors in REM-TrEMBL. SWISS-PROT plans
to create a specialised database dealing with these
sequences as a further supplement to SWISS-PROT but
will keep only a representative cross-section of these
proteins in SWISS-PROT.
- Synthetic sequences (file name Synth.dat): Another
category of data which will not be included in SWISS-PROT
are synthetic sequences.
- Small fragments (file name Smalls.dat): A subsection
with protein fragments with less than eight amino
- Patent application sequences (file name Patent.dat):
Coding sequences captured from patent applications.
A thorough survey of these entries have shown that
apart for a small minority (which have already been
integrated in SWISS-PROT), most of these sequences
contain either erroneous data or concern artificially
generated sequences outside the scope of SWISS-PROT.
- CDS not coding for real proteins (file name Pseudo.dat):
The last subsection consists of CDS translations which
are most probably not coding for real proteins.
The remaining 14 TrEMBLnew files (arc.dat, fun.dat,
inv.dat, hum.dat, mam.dat, mhc.dat, org.dat, phg.dat,
pln.dat, pro.dat, rod.dat, unc.dat, vrl.dat and vrt.dat)
will undergo further post-processing. These steps are
adding a lot of value to the TrEMBL data. Up to this
stage the annotation of the TrEMBL entries is reflecting
the status of the annotation of the CDS features in
the EMBL nucleotide sequence database. Whenever a submitter
to the DDBJ/EMBL/GenBank nucleotide sequence databases
provided insufficient or wrong annotation, most of this
erroneous information will be parsed into the TrEMBL
entries, although there are already a lot of filters
in place to get rid of the most frequently occurring
The first post-processing step is the reduction of
redundancy (O'Donovan et al., 1999). One of SWISS-PROT's
leading concepts from the very beginning was to minimise
the redundancy of the database by merging separate entries
corresponding to different literature reports. If conflicts
exist between various sequencing reports, they are indicated
in the feature table of the corresponding entry. This
stringent requirement of minimal redundancy applies
equally to SWISS-PROT + TrEMBL. However, it will still
take some time before TrEMBL has the same low level
of redundancy as SWISS-PROT. TrEMBL is partially redundant
against SWISS-PROT and against itself since a significant
percentage of the entries are actually additional reports
of proteins already present in SWISS-PROT + TrEMBL.
There two are different kinds of redundancy, which are
commonplace in many sequence databases:
- Different literature and sequence reports of a given
- Mutations, polymorphism, variations in the sequence
that are often given separate entries in the nucleotide
These redundancies should not be present in SWISS-PROT
or TrEMBL; and thus it was necessary to find methods
to manipulate the data from redundant source databases
to meet the stringent standards of minimal redundancy.
The objective was to recognise and eliminate the redundancy
already present in the databases, and to prevent further
redundancy entering the database.
A very fast and efficient method, which allows the
identification of thousands of TrEMBL entries exactly
matching SWISS-PROT or TrEMBL entries, is the use of
the CRC32 checksum. The Cyclic Redundancy Check (CRC)
calculates a nearly unique and very compact checksum
for each sequence and this allows fast and accurate
detection of identical sequences. At every TrEMBL release,
a CRC32 check is carried out to identify identical sequences
in TrEMBL and SWISS-PROT. A curator then merges these
entries manually. There is also a CRC32 check of TrEMBLnew
(the weekly TrEMBL updates) against TrEMBL and SWISS-PROT.
The TrEMBLnew entries, which match SWISS-PROT entries,
are collated for annotation by curators. TrEMBLnew entries
that match TrEMBL entries are merged into one entry
automatically, with the following exceptions:
- Viral protein fragments
- Cross-species protein fragments
- MHC fragments
- Plasmodium merozoites surface antigen fragments
- Outer membrane protein fragments
- Fusion protein fragments
- Homeobox or Homeodomain protein fragments
The redundancy removal based on the CRC32 matching
eliminated the most obvious redundancy from TrEMBL.
However there are still tens of thousands of cases of
potential (not easily detectable) redundancy, which
needs to get eliminated:
- Exact matches of fragments (a TrEMBL entry is a
fragment of a SWISS-PROT entry or vice-versa; or a
TrEMBL entry is a fragment of another TrEMBL entry).
- SWISS-PROT and TrEMBL protein entries from the same
organism which should be identical but differ due
to sequencing errors, variants, frameshifts etc.
The next step in reducing redundancy was to merge exact
subfragments to longer length sequences by using LASSAP
(Large Scale Sequence compArison Package), a software
package developed by Glemet and Codani (1997) at INRIA
in France. LASSAP has been modified specifically to
identify redundancy in SWISS-PROT and TrEMBL. The subfragment
discovery and removal is an integral part of the TrEMBL
production process at each release in order to check
for such subfragment redundancy within TrEMBLnew itself
and then between TrEMBLnew, TrEMBL and SWISS-PROT.
To give you some indication of the scope of the task,
let us have a look at TrEMBL release 10, which consisted
of 244,862 entries. TrEMBL 10 was supplementing SWISS-PROT
release 38 (around 80,000 entries) and was produced
from EMBL release 58, which contained 384,000 CDS. 120,000
of these 384,000 CDS were already present as sequence
reports in SWISS-PROT and were excluded from the TrEMBL
production process. The remaining 264,000 CDS were merged
whenever possible as described above and the final result
was 244,862 entries. This removal of ten thousands of
entries clearly shows the value of the redundancy procedures
that have been developed and implemented already. Figure
5shows an example of an automatically merged TrEMBL
entry, created by merging of the two TrEMBL entries
shown in Figure
TrEMBL is still partially redundant against SWISS-PROT
since approximately 40,000 of these entries are actually
additional sequence reports of proteins already in SWISS-PROT.
This remaining redundancy is more difficult to eliminate
since the protein entries, which should be merged, differ
due to sequencing errors, variants, frameshifts etc.
Although the merging operations are automated, all merged
entries are finally checked by biologists to avoid the
merging of sequences from two different but highly similar
genes into one entry. These time-consuming checks are
the reason why it will still take some time before SWISS-PROT
+ TrEMBL will have the same low level of redundancy
as SWISS-PROT. The biologists working on the curation
of SWISS-PROT and TrEMBL are sifting through the entries
where two or more teams report what should be an identical
sequence and their sequences differ by one residue or
more. In all these cases the curators need to decide
whether these conflicting reports are really reports
of the same gene. If they are sure that these reports
should be merged, they need to find out the nature of
the conflict: Are the differences due to strain differences,
or alleles and polymorphisms, or due to disease-causing
mutations or the product of alternative splicing? Or
has a site been experimentally altered? Or are some
of the differences only plain sequencing errors? The
answers to these questions influence the way to annotate
The second post-processing step is the automated enhancement
of the TrEMBL annotation to bring TrEMBL entries closer
to SWISS-PROT standard. There is an increasing need
for reliable automatic functional annotation to cope
with the rapidly increasing amount of sequence data.
Most of the current approaches are still based on sequence
similarity searches against known proteins. Some groups
try to collect the results of different prediction tools
in a simple way, e.g. PEDANT (Frishman and Mewes, 1997)
or GeneQuiz (Scharf et al., 1994). However, several
pitfalls of these methods have been reported (Bork and
A single sentence describing some properties of the
unknown protein is not regarded as optimal automatic
annotation of TrEMBL. Required is, as in SWISS-PROT,
as much information as possible about properties like
function(s) of the protein, domains and sites, catalytic
activity, cofactors, regulation, induction, pathways,
tissue specificity, developmental stages, subcellular
To enhance the annotation of TrEMBL, a novel method
for the prediction of this information has been developed
(Fleischmann et al., 1999). The principle is
very simple: The method tries to find SWISS-PROT entries
belonging to the same protein family as the unannotated
TrEMBL entry, extracts the annotation shared by all
SWISS-PROT entries, assigns this common annotation to
the unannotated TrEMBL entry, and flags this annotation
as annotated by similarity. The whole procedure starts
with the scanning of all TrEMBL entries for PROSITE
patterns. If a matching pattern is found, a three-step
procedure is used to reduce the number of false positive
hits. Firstly, the taxonomic classification of the TrEMBL
entry must be within the known taxonomic range of the
PROSITE pattern. For instance, a match of an a-priori
prokaryotic pattern against a human protein is regarded
as false positive and filtered out.
Secondly, the significance of the PROSITE pattern match
is checked. This is done by a second check of the TrEMBL
sequence with a set of secondary patterns derived from
the PROSITE pattern. These secondary patterns are computed
with the eMotif algorithm (Nevill-Manning et al.,
1997). The PROSITE database contains a list of all SWISS-PROT
proteins that are true members of the relevant protein
family. For each pattern, the true positive sequences
are aligned and fed into eMotif, which computes a nearly
optimal set of regular expressions, based on statistical
rather than biological evidence. A stringency of 10^-9
is used, so that each eMotif pattern is expected to
produce on random a false positive hit in 10^9 matches.
Thirdly, in cases where a protein family is characterised
by more than one PROSITE signature, all signatures must
be found in the entry. For instance, bacterial rhodopsins
have a signature for a conserved region in helix C and
another signature for the retinal binding lysine. If
a TrEMBL entry matches only the helix-C-pattern, but
not the retinal-binding pattern, it will not be regarded
as a bacterial rhodopsin.
The raw PROSITE hits and all results of the confirmation
steps are stored in a hidden section of the TrEMBL entry,
but only those hits that satisfy all confirmation conditions
are made publicly visible in a ‘DR PROSITE'line.
PROSITE signatures can characterise approximately 35%
of all TrEMBL entries, but only around 30% of all TrEMBL
entries are true positive matches. The characterization
based only on PROSITE would lead to 10-20% of false
positive assignments. The confirmation steps reduce
the level of characterization by nearly a third to 25%.
At this stage, we achieve a level of less than 0.07%
of false positive assignments.
Whenever a TrEMBL entry is recognised by these procedures
as a true member of a certain protein family, annotation
about the potential function, active sites, cofactors,
binding sites, domains, subcellular locations is added
to the entry. The main source of the annotation is compiled
by extracting the annotation that is common to all SWISS-PROT
entries of the relevant protein family. Other sources
include manual descriptions of protein families and
translations of trustworthy description libraries into
SWISS-PROT wording. For example, there is a '/SITE=9,heme_iron'
description for the cytochrome_b_heme pattern in PROSITE.
This is translated to the correct SWISS-PROT syntax:
FT METAL nn nn IRON (HEME AXIAL LIGAND)(BY
In other words, for every protein family, a "virtual
SWISS-PROT entry" is created computationally, which
is based on the specific annotation valid for all SWISS-PROT
members of this family. If a new TrEMBL protein belongs
to a certain family, the annotation of the virtual entry
for this family is immediately transferred to this TrEMBL
The "virtual SWISS-PROT entries" have a far-reaching
effect on TrEMBL. For example, the virtual entry for
the Rubisco large chain affects 3300 TrEMBL entries.
Therefore a system has been developed to decompose these
virtual entries into rules, which are stored in a relational
database with proper version control features.
This rule-based system allows expressing the membership
criteria for each protein family in a formal language.
Furthermore, subfamilies have been introduced to meet
the SWISS-PROT standard more closely. For example, the
ribosomal protein L1 family contains eukaryotes as well
as prokaryotes. But the annotation added to TrEMBL entries
of this family obviously depends on the taxonomic kingdom.
The description reads '50S RIBOSOMAL PROTEIN L1'for
prokaryotes, archaebacteria, chloroplasts, and cyanelles,
and '60S RIBOSOMAL PROTEIN L10A'for nuclear encoded
proteins of eukaryotes.
The ENZYME database (Bairoch, 1996) is also used to
generate standardised description lines for enzyme entries
and to allow information such as catalytic activity,
cofactors and relevant keywords to be taken from ENZYME
and to be added automatically to TrEMBL entries. Additionally
specialised databases like FlyBase (FlyBase Consortium,
1999) and MGD (Blake et al., 1999) are used to
transfer information like the correct gene nomenclature
and cross-references to these databases into TrEMBL
entries. The automatic analysis and annotation of TrEMBL
entries is redone and updated every TrEMBL release.
The now fully post-processed TrEMBL entry, already
used as an example before, is shown in Figure
6. Although this computer-generated annotation is
already enhancing the information about the sequence
drastically, it is still a long way to the quality of
the corresponding SWISS-PROT entry (shown in Figure
7), fully annotated by biologists.
InterPro and EDITtoTrEMBL. Currently around
20% of the TrEMBL entries get additional annotation
in the way described above. There are two main reasons
for this low coverage:
- To avoid overprediction very stringent criteria
have been used.
- So far rules have been created for only a quarter
of all PROSITE families.
It is easily possible to yield a higher coverage, if
more patterns and improved conditions are used. The
procedures have been found to be stable and reliable,
therefore it is planed to add more rules to the RuleBase.
The patterns and conditions will be based on the characterisation
of SWISS-PROT and TrEMBL entries by InterPro, the Integrated
Resource of Protein Domains and Functional Sites, a
joint initiative of the databases PROSITE, (Hofmann
et al., 1999), Pfam (Bateman et al., 1999),
PRINTS (Attwood et al., 1999), ProDom (Corpet
et al., 1999), and SWISS-PROT + TrEMBL (Bairoch
and Apweiler, 1999). InterPro will serve as a common
co-ordinate system, harmonising domain definitions,
nomenclature, annotation, match-lists and hyperlinks,
while the participating databases will maintain their
individual approaches with all the known benefits. Up
to now, it was difficult to compare hits to the different
databases, as they are based on different protein database
versions. This synchronisation problem has been solved.
The above mentioned motif databases will continue with
their release schedules, while the time between releases
is covered by the EBI on a weekly basis. InterPro entries
contain links to the motif databases, a general description,
method specific descriptions, references, and a list
of matched proteins. Every entry is classified as describing
a protein family, a domain, or a post-translational
modification site. A more detailed description of InterPro
is given in the chapter about secondary protein sequence
The addition of InterPro based rules to the RuleBase
is of huge importance, since the RuleBase is a central
component of EDITtoTrEMBL (Environment for Distributed
Information Transfer to TrEMBL), which was used for
the first time in August 1998 for the production of
TrEMBL release 7 (Möller et al., 1999). EDITtoTrEMBL
aims to provide a stable framework where different analyzing
programs can be integrated in a plug-and-play manner.
Not only the amount of data is rapidly increasing, but
also the number of analyzing programs enabling the prediction
of functional properties of proteins is constantly rising.
EDITtoTrEMBL executes analysing programs, which are
controlled by conditions that must be fulfilled to make
their application meaningful. These conditions are stored
in the RuleBase. EDITtoTrEMBL is implemented in Java
and facilitates communication between programs using
Remote Method Invocation. Figure
8depicts the flow of data inside the framework.
Databases and applications are used as potential sources
of protein annotation. Although there is a certain difference
between these two methods, since databases are queried
while applications are started, the system does not
distinguish between them. In both cases it is necessary
to provide so-called wrappers written in JAVA to support
the physical distribution of annotation processes. These
wrappers solve three tasks:
- Reformatting of a TrEMBL entry to a valid input
for a program or a query. For programs, this is usually
easy since most programs either accept TrEMBL entries
directly or use FASTA format. For queries, the wrapper
extracts certain parts of the TrEMBL entry, which
is then send to the database.
- Each wrapper chooses an optimal setting of parameters
for each individual entry.
- To ensure consistency with the controlled vocabulary
of SWISS-PROT, the program output is transformed according
to the manually curated set of rules in the RuleBase.
The unit of a wrapper with its associated program or
database query is called an analyser. Analysers are
often highly specific. The correctness of their results
depends partially on certain conditions, such as the
taxonomic specification. Annotation added by an analyser
is often in turn exploited by other analysers executed
later. EDITtoTrEMBL uses the conditions, which are stored
in the RuleBase, for the execution of analysers. Dispatchers,
programs that coordinate the flow of entries between
different analysers, evaluate these conditions.
SWISS-PROT + TrEMBL,
a complete and non-redundant view on the protein world?
This section will focus on the use of SWISS-PROT
+ TrEMBL for sequence similarity searches. Searches
in protein sequence databases have now become a standard
research tool in the life sciences. To produce valuable
results, the source databases should be comprehensive,
non-redundant, well annotated and up-to-date. However,
the lack of a single protein sequence database satisfying
all four criteria has previously forced users to perform
searches across multiple databases to avoid incomplete
results. This strategy normally produces complete, but
redundant results due to different versions of the same
sequence report in different databases.
To improve this unsatisfying situation, many bioinformatics
sites construct non-redundant databases from a number
of component databases, or they use external non-redundant
databases, e.g. OWL (Bleasby et al., 1994). Both
strategies improve the situation for the end user considerably,
but they require the time- and resource-consuming maintenance
of multiple databases or the acceptance of a certain
time lag between creation of an entry and its appearance
in the non-redundant database. Furthermore, both strategies
lead to a loss of information in the individual entry
due to the diversity of database formats. While OWL
preserves most information of an entry and some of its
structure, the NRDB program requires a conversion of
the component databases to FASTA format, which contains
only one description line per entry.
SP_TR_NRDB (or abbreviated SPTR) was created to overcome
these limitations. SPTR provides a comprehensive, non-redundant
and up-to date protein sequence database with a high
information content. The components are:
- The weekly updated SWISS-PROT work release. It contains
the last SWISS-PROT release as well as the new or
- The weekly updated SP-TrEMBL work release. REM-TrEMBL
is not included in SP_TR_NRDB, since REM-TrEMBL contains
the entries that will not be included into SWISS-PROT,
e.g. synthetic sequences and pseudogenes.
- TrEMBLnew, the weekly updates to TrEMBL.
During the weekly SP_TR_NRDB building process, all
three components undergo a syntax error check and a
redundancy check. Entries, which are filtered out during
the error check or the redundancy check, are manually
updated and reintegrated in the next weekly SPTR release.
In the interest of regular updates the SPTR production
is not delayed until the erroneous entries have been
corrected. This introduces a minimal incompleteness
in SPTR, but the current average of five extracted entries
or 0.002% of all entries per weekly release is regarded
The redundancy check used during the weekly SPTR production
ensures non-redundancy on the level of accession numbers,
IDs, and Protein_IDs. Entries with sequence similarity
are at this stage not merged into single entries because
this would also merge entries, which should be kept
separate, e.g. fragments of different viral strains.
When building the quarterly major releases of the component
databases, LASSAP enables the identification of entries
that are candidates for merging. The TrEMBL redundancy
removal procedures have already been described in detail
Various verification steps have been introduced to
ensure that SPTR is comprehensive and contains all relevant
data sources. The main source of new protein sequences
is the translations of CDS in the nucleotide sequence
databases. The up-to-date inclusion of new protein sequence
entries is ensured by the weekly translation of EMBL-NEW
(the updates to the EMBL nucleotide sequence database).
The three collaborating nucleotide sequence databases
DDBJ, EMBL and GenBank exchange their data on a daily
basis. Therefore any protein coding sequence submitted
to DDBJ/EMBL/GenBank will appear in SPTR within two
weeks in the worst case and within less than one week
in the average case.
Another major source are the amino acid sequences directly
derived from protein sequencing. Thousands of such sequences
have been detected by the SWISS-PROT curators in publications
(or have been directly submitted by researchers to SWISS-PROT)
and entered into the database. Protein sequences detected
by the NCBI journal scan have also been included. For
some proteins the Brookhaven Protein Data Bank (PDB)
(Abola et al., 1996) is the only source for the
sequence information. The PDB entries are regularly
checked and new SWISS-PROT entries get created whenever
The only additional publicly available protein sequence
data, which might not be included in SPTR, are sequences
that have been overlooked by SWISS-PROT, but have been
detected by PIR. Detailed checks of this data source
have been made, to be sure that SPTR contains all publicly
available naturally occurring proteins. As a first step
it was checked which PIR entries have been cross-referenced
by SPTR entries. These entries have been marked as matched
because the cross-references to PIR are manually added
to SPTR entries and refer to directly corresponding
entries. Then the entries containing a PID have been
marked as matched because all these entries are contained
in SPTR as manually curated SWISS-PROT entries or as
EMBL translations in TrEMBL/TrEMBLnew. Finally, full-length
sequence matches and matches of PIR fragments against
longer SPTR and REM-TrEMBL entries have been marked.
The remaining PIR entries (around 10% of PIR) have been
manually checked. In the majority of cases, these entries
were different (redundant) reports for the same sequence
and already included in SPTR in entries with merged
sequence reports. In the cases where the entries were
really missing in SPTR (around 3% of PIR entries), the
SWISS-PROT curators went back to the original publication
and created from the original publication new SWISS-PROT
entries to complete SPTR. These checking procedures
are being done continuously, so that SPTR offers a comprehensive
view of the protein sequence world. The only protein
sequences not contained in SPTR are the sequences from
the REM-TrEMBL entries, since REM-TrEMBL contains the
entries that will not be included into SWISS-PROT, e.g.
synthetic sequences and pseudogenes, but these remain
available in the REM-TrEMBL distribution.
SPTR has been produced weekly since its start in January
1998. At the 10.9.1999 SPTR contained 352,393 entries:
80,681 SWISS-PROT entries, 198,791 TrEMBL entries and
72,921 TrEMBLnew entries. As the rate of incoming data
and the addition of value through manual curation and
automatic annotation increase, it is planned to start
producing SPTR daily in the near future.
SPTR is distributed in three files: sprot.dat.Z, trembl.dat.Z
and trembl_new.dat.Z. These files are, as indicated
by their "Z" extension, Unix "compress"
format files which, when decompressed, will produce
ASCII files in SWISS-PROT format. Three others files
are also available (sprot.fas.Z, trembl.fas.Z and trembl_new.fas.Z),
which are compressed "fasta" format sequence
files useful for building the databases used by FASTA,
BLAST and other sequence similarity search programs.
Please do not use these files for other purposes as
you loose all annotation by using this format. The SPTR
files are stored in the directory "/pub/databases/sp_tr_nrdb"
on the EBI FTP server (ftp.ebi.ac.uk) and in the directory
"/databases/sp_tr_nrdb" on the ExPASy FTP
Please note that
- the SWISS-PROT file continuously grows as new annotated
sequences are added.
- the TrEMBL file decreases in size as sequences are
moved out of that section after being annotated and
moved into SWISS-PROT. Four times a year a new release
of TrEMBL is built at EBI and at this point the TrEMBL
file increases in size as it then includes all of
the new data that has accumulated since the last release.
- the TrEMBLnew file starts as a small file and grows
in size until a new release of TrEMBL is available.
You will not find any primary accession number duplicated
between SWISS-PROT and TrEMBL, since they are sharing
the same system of accession numbers. A TrEMBL entry
(and its associated accession number(s)) can either
move to SWISS-PROT as a new entry or be merged with
an existing SWISS-PROT or TrEMBL entry. In the later
case, the accession number(s) of that TrEMBL entry are
added to that of the SWISS-PROT entry.
protein sequence databases
And now a few words about specialised protein
sequence databases. There are many of them, some of
them are quite small and only contain a handful of entries,
and others are wider in scope and larger in size. This
chapter will finish with a brief description of three
representative examples of specialised protein sequence
databases. As this category of databases is quite changeable,
any list provided here would soon be outdated. However,
under the URL http://www.expasy.ch/alinks.html#Proteinsyou
will find a WWW document that lists information sources
for molecular biologists, which is kept constantly up-to-date.
MEROPS. TheMEROPS database (Rawlings and Barrett,
1999) provides a catalogue and structure-based classification
of peptidases (i.e. all proteolytic enzymes). An index
of the peptidases by name or synonym gives access to
a set of files termed PepCards, each of which provides
information on a single peptidase. Each card file contains
information on classification and nomenclature, and
hypertext links to the relevant entries in other databases.
The peptidases are classified into families on the basis
of statistically significant similarities between the
protein sequences in the part termed the `peptidase
unit' that is most directly responsible for activity.
Families that are thought to have common evolutionary
origins and are known or expected to have similar tertiary
folds are grouped into clans. The MEROPS database provides
sets of files called FamCards and ClanCards describing
the individual families and clans. Each FamCard document
provides links to other databases for sequence motifs
and secondary and tertiary structures, and shows the
distribution of the family across the major taxonomic
GCRDb. GCRDb (Kolakowski, 1994) is a database
of sequences and other data relevant to the biology
of G-protein coupled receptors (GCRs), a very large
protein family of critical components of many different
signalling systems in animals. As can be seen in Figure
9, the information available in a GCRDb entry is
not much more extensive than what you would find in
the EMBL nucleotide sequence entry from which it is
derived. What makes this database useful are not the
entries themselves, but the analyses s (e.g. multiple
alignments, classification into subfamilies) which have
been made on the data and which are available from the
GCRDb database. It is a good example for a specialised
database adding value by offering an analytical view
on data which a universal sequence database is unable
YPD. YPD (Hodges et al.,1997) is a database
for the proteins of S. Cerevisiae. Based on the
detailed curation of the scientific literature for the
yeast Saccharomyces cerevisiae, YPD contains more than
50 000 annotations lines derived from the review of
8500 research publications. The information concerning
each of the more than 6000 yeast proteins is structured
around a one-page format, the Yeast Protein Report,
with additional information provided as pop-up windows.
Protein classification schemas are defining each protein's
cellular role, function and pathway. YPD provides the
user with a succinct summary of the protein's function
and its place in the biology of the cell. The first
transcript profiling data has been integrated into the
YPD Protein Reports, providing the framework for the
presentation of genome-wide functional data. Altogether
YPD is a very useful data collection for all yeast researchers
and especially for those working on the yeast proteome.
The ENZYME database (http://www.expasy.ch/enzyme/)
is an annotated extension of the Enzyme Commission's
publication, linked to SWISS-PROT. There are also databases
of enzyme properties – BRENDA, Ligand Chemical Database
for Enzyme Reactions (LIGAND http://www.genome.ad.jp/dbget/ligand.html),
and the Database of Enzymes and Metabolic Pathways (EMP).
BRENDA, LIGAND and EMP are searchable via SRS at the
LIGAND is linked to the metabolic pathways in KEGG (http://www.genome.ad.jp/kegg/kegg.html).
Databases of two –dimensional gel electrophoresis
data are available from Expasy (http://www.expasy.ch/ch2d/)
and the Danish Centre for Human Genome Research (http://biobase.dk/cgi-bin/celis/).
A useful resource for mass spectrometry protein data,
including protein cleavage products, is maintained at
Rockefeller University (http://prowl.rockefeller.edu/).
Very often the sequence of an unknown protein
is too distantly related to any protein of known structure
to detect its resemblance by overall sequence alignment,
but it can be identified by the occurrence in its sequence
of a particular cluster of residue types which is variously
known as a pattern, motif, signature, or fingerprint.
These motifs arise because of particular requirements
on the structure of specific region(s) of a protein,
which may be important, for example, for their binding
properties or for their enzymatic activity. These requirements
impose very tight constraints on the evolution of those
limited (in size) but important portion(s) of a protein
sequence. A signature modelling such a site must be
as short as possible, should detect all or most of the
sequences it is designed to describe and should not
give too many false positive results. In other words
it must exhibit both high sensitivity and high specificity.
There are a few databases available, which use different
methodology and a varying degree of biological information
on the characterised protein families, domains and sites.
includes extensive documentation on many protein families,
as defined by sequence domains or motifs.
Other databases in which proteins are grouped, using
various algorithms, by sequence similarity include PRINTS
and SBASE (http://base.icgeb.trieste.it/sbase/).
These secondary protein sequence databases have become
vital tools for identifying distant relationships in
novel sequences and hence for inferring protein function.
During the last decade, these databases have evolved
by using signature-recognition methods to address different
sequence analysis problems, resulting in rather different
and independent databases. To perform a comprehensive
analysis, a user therefore has to know several important
things. For example, what are the resources and where
can they be found? What is the difference between them
in terms of diagnostic performance and family coverage?
What do the different search outputs mean? Is it sufficient
to use just one of the databases, and if so, which one?
Or, given the seeming complexity, won't PSI-BLAST (Altschul
et al., 1997) do just as well?
Diagnostically, the most commonly used secondary protein
databases (PROSITE, PRINTS and PFAM) have different
areas of optimum application owing to the different
strengths and weaknesses of their underlying analysis
methods (regular expressions, profiles, Hidden Markov
Models and fingerprints). For example, regular expressions
are likely to be unreliable in the identification of
members of highly divergent super-families (where profiles
and HMMs excel); fingerprints perform relatively poorly
in the diagnosis of very short motifs (where regular
expressions do well); and profiles and HMMs are less
likely to give specific sub-family diagnoses (where
In terms of family coverage, PROSITE, PRINTS and PFAM
are similar in size but differ in content – each contains
between 1,000-1,500 entries, spanning a range of globular
and membrane proteins, modules and mosaics, repeats,
and so on. While all of the resources share a common
interest in protein sequence classification, some focus
on divergent domains (e.g., Pfam), some focus
on functional sites (e.g., PROSITE), and others
focus on families, specialising in hierarchical definitions
from super-family down to sub-family levels in order
to pin-point specific functions (e.g., PRINTS).
A number of sequence cluster databases are also commonly
used in sequence analysis, for example to facilitate
domain identification (e.g., ProDom). Unlike
pattern databases, the clustered resources are derived
automatically from sequence databases, using different
clustering algorithms. This allows them to be relatively
comprehensive, because they do not depend on manual
crafting and validation of family discriminators; but
the biological relevance of clusters can be ambiguous
and may just be artefacts of particular thresholds.
Given these complexities, analysis strategies should
endeavour to combine a range of databases, as none alone
is sufficient. In concert, however, they can complement
routine sequence database searches by providing more
specific diagnoses than are possible with tools such
as PSI-BLAST. PSI-BLAST highlights generic similarities
by gathering sequences into families using an iterative
profiling technique. However, there are problems with
this approach. For example, if a multi-domain protein
is matched, it may not be clear whether the region matched
is the functional part of the protein, and hence whether
functional annotations can be reliably transferred to
the query; similarly, if a large super-family has been
matched, it may be difficult to make the correct family
or sub-family diagnosis.
Unfortunately, these secondary databases do not share
the same formats and nomenclature, which makes the use
of all of them in an automated way difficult. In response
to this the SWISS-PROT and TrEMBL group at the EBI is
working with the PROSITE, PRINTS, Pfam and ProDom groups
on the integration of these databases into an Integrated
resource of Protein domains and functional sites (InterPro).
InterPro will allow users access to a wider, complementary
range of site and domain recognition methods in a single
PROSITE. The special value of this database
is the extensive documentation on many protein families,
as defined by sequence domains or motifs. PROSITE contains
biologically significant sites and patterns formulated
in such a way that with appropriate computational tools
it can rapidly and reliably identify to which family
of proteins the new sequence belongs.
Release 15 of PROSITE contained motifs for 1034 protein
families and sites. Most of the motifs in PROSITE are
regular expressions, so called patterns. Around hundred
of the motifs are so called extended profiles. Of the
available analysis methods, regular expressions are
the simplest to derive. Conserved motifs within sequence
alignments are reduced into consensus expressions, in
which all but the most significant residue information
is discarded. In terms of their performance in pattern
recognition, regular expressions have certain limitations.
Patterns may themselve encode flexibility, or fuzziness,
but require query sequences to match them exactly. Thus
sequences that differ only slightly from the definition
will be missed. Also, there are a number of protein
families as well as functional or structural domains
that cannot be detected using regular expressions due
to their extreme sequence divergence.
The building of a PROSITE pattern usually starts by
studying review(s) on a group or family of proteins.
Then an alignment of the proteins discussed in that
review and of additional sequences relevant to the subject
under consideration is build. Using such alignments
particular attention is paid to the residues and regions
thought or proved to be important to the biological
function of that group of proteins. Now a `core' pattern,
a short conserved sequence, is created that is part
of a region known to be important or which include biologically
significant residue(s). The most recent version of SWISS-PROT
is then scanned with these core pattern(s). If a core
pattern will detect all the proteins under consideration
and none (or very few) of the other proteins the core
pattern is used as the signature. In most cases a core
pattern picks up additional sequences which clearly
do not belong to the group of proteins under consideration.
Iterative series of scans, involving a gradual increase
in the size of the pattern, are then necessary.
There are a number of protein families as well as functional
or structural domains that cannot be detected using
patterns due to their extreme sequence divergence; the
use of techniques based on profiles or weight matrices
allows the detection of such proteins or domains with
sequence divergence. A profile is a table of position-specific
amino acid weights and gap costs. These numbers are
used to calculate a similarity score for any alignment
between a profile and a sequence, or parts of a profile
and a sequence. An alignment with a similarity score
higher than or equal to a given cut-off value constitutes
a motif occurrence. As with regular expressions, there
may be several matches to a profile in one sequence,
but multiple occurrences in the same sequence must be
disjoint (non-overlapping) according to a specific definition
included in the profile. Unlike patterns, profiles are
usually not confined to small regions with high sequence
similarity, but attempt to characterise a protein family
or domain over its entire length. Profiles are supposed
to be more sensitive and more robust than patterns because
they provide discriminatory weights not only for the
residues already found at a given position of a motif
but also for those not yet found. The weights for those
not yet found are extrapolated from the observed amino
acid compositions using empirical knowledge about amino
Since 1994 PROSITE complements regular expression entries
by gradually adding profile entries. The profile structure
used in PROSITE is similar to but slightly more general
than the one introduced by Gribskov and co-workers (Gribskov
et al.,1987). Generalised profiles are remarkably
similar to the specific type of Hidden Markov Models
(HMMs) used in PFAM (explained below).
PRINTS. A different approach to pattern recognition,
termed "fingerprinting" is used by PRINTS.
Within a sequence alignment, it is usual to find not
one, but several motifs that characterise the aligned
family. Diagnostically, it makes sense to use many,
or all, of the conserved regions to build a family signature.
In a database search, there is then a greater chance
of identifying a distant relative, whether or not all
parts of the signature are matched. Thus, for example,
a sequence that matches only four of seven motifs may
still be diagnosed as a true match if the motifs are
matched in the correct order in the sequence and the
distances between them are consistent with that expected
of true neighbouring motifs. The ability to tolerate
mismatches, both at the level of residues within individual
motifs, and at the level of motifs within the fingerprint
as a whole, renders fingerprinting a powerful diagnostic
provides for each fingerprint extensive documentation
about the characterised protein family, domain, or functional
site. Release 23.1 contained 1159 fingerprints.
Pfam. Another important secondary protein database
is Pfam. Release 4.1. of Pfam contained 1488 entries
The methodology used by Pfam to create protein family
or domain signatures is Hidden Markov Models (HMMs).
HMMs are closely related to profiles, but are based
on probability theory methods. These allow a direct
statistical approach to identifying and scoring matches,
and also to combining information from a multiple alignment
with prior knowledge. One feature that distinguishes
HMMs and profiles from regular expressions and fingerprints
is that the formers allow the full extent of a domain
to be identified in a sequence. They are thus particularly
useful when analysing multidomain proteins.
PFAM consists of two parts. PFAM-A is curated and
contains well-characterised protein families with high-quality
alignments, which are maintained by using manually checked
seed alignments and HMMs to find and align all members.
PFAM-B is based on ProDom and to clusters and aligns
the remaining protein sequences after removal of PFAM-A
domains. PFAM-A families have stable accession numbers
and form a library of HMMs available for scanning of
protein sequences. The biggest drawback of Pfam is its
lack of biological information (annotation) of the protein
InterPro. In the task of sequence characterisation,
we need more reliable, concerted methods for identifying
protein family traits and for inheriting functional
annotation. This is especially important given our dependence
on automatic methods for assigning functions to the
raw sequence data issuing from genome projects. But
rationalising this process by creating a single coherent
resource for diagnosis and documentation of protein
families is difficult, given entirely different database
formats, different search tools and different search
outputs. InterPro is an attempt to address some of these
issues. This new resource provides an integrated view
of a number of commonly used pattern databases, and
provides an intuitive interface for text- and sequence-based
The first release of InterPro was built from Pfam 4.1
(1,488 domains), PRINTS 23.1 (1,159 fingerprints) and
PROSITE 15 (1,034 families).
Flat-files submitted by each of the groups were systematically
merged and dismantled. Where relevant, family annotations
were amalgamated, and all method-specific annotation
separated out. This process was complicated by the relationships
that can exist, both between entries in the same database,
and between entries in different databases. Different
types of parent-child relationship were evident, leading
to the differentiation into ‘sub-types' and ‘sub-strings'.
A sub-string means that a motif or motifs are contained
within a region of sequence encoded by a wider pattern
(e.g., a PROSITE pattern is typically contained
within a PRINTS fingerprint; or a fingerprint might
be contained within a Pfam domain). A sub-type means
that one or more motifs are specific for a sub-set of
sequences captured by another more general pattern (e.g.,
a super-family fingerprint may contain several family-
and sub-family-specific fingerprints; or a generic Pfam
domain may include several family fingerprints).
Having classified the parent-child relationships of
overlapping PROSITE, PRINTS and Pfam entries, all recognisably
distinct entities were assigned unique accession numbers
(which take the form IPR00000). In doing this, the general
principle was adopted that parents and children with
sub-string relationships usually have the same IPR numbers,
while sub-type parent-child relationships warrant their
To facilitate in-house maintenance, InterPro is managed
within a relational database system. For users, however,
the core InterPro entries are released in a single ASCII
(text) flatfile, which is written in XML. The overall
data flow, from individual data provider, through the
DBMS, out to the flatfile and on to the user, is fairly
complex – a flavour of this complexity is given in Figure
Release 1.0 (November 1999) contains nearly 2,300 entries,
representing families, domains, repeats and PTMs encoded
by 4,300 different regular expressions, profiles, fingerprints
and HMMs. Overall, InterPro entries have more than 370,000
hits against sequences in SWISS-PROT and TrEMBL.
InterPro is accessible for interactive use via the
EBI Web server, which can also be reached via each of
the member databases. The Web interface allows text-based
searches using SRS (Etzold et al., 1996) and
sequence-based searches using software provided by the
consortium members. Interpretation of output is facilitated
by means of a graphical user interface, which has been
extended from the tools used to visualise ProDom families.
Thus, for each sequence, the domain and/or motif organisation
can be seen at a glance.
The flatfile distribution may be retrieved from the
EBI anonymous-ftp server (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
While the initial InterPro release was created around
PRINTS, PROSITE and Pfam. ProDom will shortly also be
included. Various factors rendered a step-wise approach
to the development of InterPro desirable. First, the
scale of the task of amalgamating just the first three
databases was immense. The rational merging of apparently
equivalent database entries that in fact simultaneously
define a specific family, domains within that family,
or even repeats within those domains, presented an enormous
challenge. Thus, the immediate goal for InterPro was
to limit the problem only to databases that offered
annotation. A second important consideration was that
while Pfam, PRINTS and PROSITE are true pattern databases,
ProDom is based solely on automatic clustering of sequences
by similarity (i.e., discriminators are not derived).
Resulting clusters need not have precise biological
correlations and some family designations have changed
between database versions. It was therefore necessary
that ProDom should adopt stable accession numbers before
its entries could be meaningfully considered for inclusion
in InterPro. The full integration of ProDom into InterPro
will be achieved in release 2 (May 2000).
Once the founder members of the InterPro consortium
have been assimilated into the unified resource, other
pattern databases will also be included. First, scheduled
for release 3 (November 2000), will be the SMART resource
(Schultz et al., 1998). In addition, the Blocks
database (Henikoff et al., 1999) is planning
to use InterPro as the basis for the creation of Blocks.
As Blocks does not include annotation and will be based
on families already in InterPro, the process of cross-referencing
between Blocks and InterPro, and even the full integration
of Blocks within InterPro, should be relatively straightforward.
Ultimately, InterPro will include many other protein
family databases to give a more comprehensive view of
the resources available.
A primary application of InterPro's family, domain
and functional site definitions will be in the computational
functional classification of newly determined sequences
that lack biochemical characterisation. For instance,
the EBI will use InterPro for enhancing the automated
annotation of TrEMBL. This process should be more efficient
and reliable than using each of the pattern databases
separately, because InterPro will provide internal consistency
checks and deeper coverage. This has been already outlined
in detail earlier in this article.
Another major use of InterPro will be in identifying
those families and domains for which the existing discriminators
are not optimal and could hence be usefully supplemented
with an alternative pattern (e.g., where a regular
expression identifies large numbers of false matches
it could be useful to develop an HMM, or where a Pfam
entry covers a vast super-family it could be beneficial
to develop discrete family fingerprints, and so on).Alternatively,
InterPro is likely to highlight key areas where none
of the databases has yet made a contribution and hence
where the development of some sort of pattern might
The number of known protein structures is increasing
very rapidly and these are available through the Protein
Data Bank (PDB, http://www.rcsb.org/pdb/).
The Nucleic Acid Database (NDB, http://ndbserver.rutgers.edu/)
is the database for structural information about nucleic
acid molecules. There is also a database of structures
of ‘small' molecules, of interest to biologists concerned
with protein-ligand interactions, from the Cambridge
Crystallographic Data Centre (http://www.ccdc.cam.ac.uk/).
Abola, E.E., Manning, N.O., Prilusky, J., Stampf,
D.R., and Sussman, J.L. (1996) J. Res. Natl. Inst. Stand.
Technol. 101, 231-241.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang,
J., Zhang, Z., Miller, W., and. Lipman, D.J (1997).
Acids Res. 25, 3389-3402.
Apweiler, R., Gateau, A., Contrino, S., Martin, M.J.,
Junker, V., O'Donovan, C., Lang, F., Mitaritonna, N.,
Kappus, S., and Bairoch, A. (1997). In: "Proceedings
of the Fifth International Conference on Intelligent
Systems for Molecular Biology (ISMB)" (T. Gaasterland,
P. Karp, K. Karplus, C. Ouzounis, C. Sander, and A.
Valencia, eds.), pp. 33-43. AAAI Press, Menlo Park.
Attwood, T. K., Flower, D. R., Lewis, A. P., Mabey,
J. E., Morgan, S. R., Scordis, P., Selley J. N., and
Wright W. (1999). Nucl.
Acids Res. 27, 220-225.
Bairoch, A. (1996). Nucl.
Acids Res. 24,221-222.
Bairoch, A., and Apweiler, R. (1999). Nucl.
Acids Res. 27, 49-54.
Barker, W.C., Garavelli, J.S., McGarvey, P.B., Marzec,
C.R., Orcutt, B.C., Srinivasarao, G.Y., Yeh, L.-S.L.,
Ledley, R.S., Mewes, H.-W., Pfeiffer, F., and Tsugita,
A. (1999). Nucleic
Acids Res., 27, 39-43.
Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Finn,
R.D., and Sonnhammer, E.L.L. (1999). Nucl.
Acids Res. 27, 260-262.
Blake, J.A., Richardson, J.E., Davisson, M.T., and
Eppig, J.T. (1999). Nucl.
Acids Res. 27, 95-98.
Bleasby, A., Akrigg, D., Attwood, T.K. (1994) Nucl.
Acids Res. 22, 3574-3577.
Bork, P., and Koonin, E.V. (1998) Nature Genet. 18,
Corpet, F., Gouzy, J., and Kahn, D. (1999). Nucl.
Acids Res. 27, 263-267.
Dayhoff, M.O., Eck, R.V., Chang, M.A., and Sochard,
M.R. (1965). Atlas of Protein Sequence and Structure
Vol. 1. National Biomedical Research Foundation, Silver
Dayhoff, M.O. (1979). Atlas of Protein Sequence and
Structure Vol., 5, Supplement 3. National Biomedical
Research Foundation, Washington, DC.
Etzold, T, Ulyanov, A., and Argos, P. (1996) Methods
Fleischmann, W., Möller, S., Gateau, A., and Apweiler,
R. (1999). Bioinfomatics
FlyBase Consortium (1999).Nucl.
Acids Res. 27, 85-88.
Frishman, D., and Mewes, H.-W. (1997). Trends in Genetics
Glemet, E., and Codani, J.-J. (1997). Comp. Appl. Bio.
Sci. 13, 137-143.
Gribskov, M., McLachlan, A.D., and Eisenberg D. (1987).
Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358.
Henikoff, S., Henikoff, J.G., and Pietrokovski, S.
Hodges, P.E., McKee, A.H.Z., Davis, B.P., Payne, W.E.,
and Garrels, J.I. (1999). Nucl.
Acids Res. 27, 69-73.
Hofmann, K., Bucher, P., Falquet, L., and Bairoch,
A. (1999). Nucl.
Acids Res. 27, 215-219.
Kolakowski, L.F. Jr. (1994). Receptors Channels 2,
Möller, S., Leser, U., Fleischmann, W., and Apweiler,
R. (1999). Bioinfomatics
Nevill-Manning, C.G., Sethi, K.S., Wu, T.D., and Brutlag
D.L. (1997). In: "Proceedings of the Fifth International
Conference on Intelligent Systems for Molecular Biology
(ISMB)" (T. Gaasterland, P. Karp, K. Karplus, C.
Ouzounis, C. Sander, and A. Valencia, eds.), pp. 202-209.
AAAI Press, Menlo Park.
O'Donovan, C., Martin, M.J., Glemet, E., Codani, J.-J.,
and Apweiler, R. (1999). Bioinfomatics
Rawlings, N.D., and Barrett, A.J. (1999). Nucl.
Acids Res. 27, 325-331.
Scharf, M., Schneider, R., Casari, G., Bork, P., Valencia,
A., Ouzounis, C., and Sander, C. (1994). In: "Proceedings
of the Second International Conference on Intelligent
Systems for Molecular Biology (ISMB)" (R. Altman,
D. Brutlag, P. Karp, R. Lathrop, D. Searls, eds.), pp.
348-353. AAAI Press, Menlo Park.
Schultz, J., Milpetz, F., Bork, P., and Ponting, C.P.
(1998) Proc.Natl.Acad.Sci.USA 95, 5857-5864.
Stoesser, G., Tuli, M.A., Lopez, R., and Sterk, P.
Acids Res. 27, 18-24.