About the special features of MEROPS
Among the special features that are to be found in MEROPS are a batch BLAST service, individual BLAST searches, domain images, EST alignments, evolutionary trees, literature pages, molecular images, sequence alignments and substrate specificity data. The present section is intended to give a little background to each of these.
The user may submit a file of up to 5,000 protein sequences in FastA format to the MEROPS Batch Blast service. Each sequences will be compared against a special library by use of BLASTP (Altschul et al., 1990). The special library contains the sequences of all the peptidase and inhibitor holotypes as well as all the sequences that act as linkers for transitive relationships in the families. The user provides the name of the file to be uploaded and an e-mail address for the results. When the analysis is complete, usually within the same day, a report will be sent. This will list only peptidase and inhibitor units detected in the input file (an E value of e-04 or less being considered significant). For each significant match the following are shown: the submitted sequence identifier, the MEROPS family name, the range of the peptidase or inhibitor unit, active site residues, ligands for catalytic metal ions, the MERNUM of the closest homologue and the E-value for the match. Active site residues and metal ligands are not predicted for inhibitors. For example:
YP_393134.1 M1 215-434 E299, Y382 H298, H302, E321 MER24282 8.00e-20
An active site residue or metal ligand is shown in single letter code followed by the residue number. Where a residue does not match any of those permitted for the active site residue or metal ligand at that position, the permitted amino acids are shown following the residue number. For example:
YP_394542.1 S16 16-127 Q78S, L120K/R MER53030 4.00e-14
Here, both active site residues have been replaced; the active serine with glutamine and the active site lysine or arginine with leucine. Although this sequence is that of a homologue of family S16 it would be predicted not to be a peptidase.
A hyphen preceding a residue number means that the active site residue or metal ligand in the submitted sequence is missing. An angled bracket preceding an active site residue or metal ligand indicates that the submitted sequence contains only a fragment of the peptidase unit, for example:
YP_390722 S33 39-200 S122, >D, >H MER41096 7.00e-05
Here, the submitted sequence contained only an N-terminal fragment of the peptidase unit, and only the active site serine was found.
The Blast search in MEROPS is implemented as follows. A library has been compiled containing amino acid sequences of peptidase units from our entire collection for peptidases and peptidase homologues. A second library contains inhibitor units from our collection of protein inhibitor sequences. The searches available are BLASTP (protein sequence query against a protein sequence database) and TBLASTX (nucleic acid sequence query against a protein sequence database). The BLAST programs were described by Altschul et al., 1990.
Cleavage data upload
The user may submit a file of up to 5,000 protein cleavage sites in tab-delimited format. Data in the file must be in the following format:
|MEROPS identifier||protein identifier||sequence N-terminal of cleavage||sequence C-terminal of cleavage|
There are some special MEROPS identifiers for cleavages where the peptidase has not been fully characterized. The identifier CLE_UNK indicates that the peptidase is unknown; an identifier such as CLE_C14 indicates that the cleavage is by a member of the caspase (C14) family, but the actual peptidase is not known. For a cleavage by a member of the chymotrypsin family (S1), the identifier would be CLE_S01.
The protein identifier can be either a UniProt accession (e.g. P02453) or an EMBL/GenBank ProtID (e.g. AAA35526).
The total length of sequence uploaded must be at least eight residues and should be enough for an automated search to find the exact cleavage within the database entry. If the protein substrate sequence contains repeating elements, then it may be necessary to provide more than eight residues. For an exopeptidase cleavage, it will be necessary to provide more than four residues on one side of the cleavage site. For example, for an aminopeptidase cleavage there will be only one residue on the non-prime side of the cleavage, so please provide at least seven residues on the prime side.
You are also requested to provide an E-mail contact, a reference and/or a PUBMED identifier. The submitted cleavages will be mapped to UniProt entries and the position of the cleavage site calculated. Mapped cleavages will appear in the next release of the MEROPS database.
These Search pages allow the user to ask a question like "What peptidases are in the mouse genome but not in the human?". When the sequencing and analysis of the two genomes have been properly completed, it is in principle a simple matter to make this sort of comparison. Having selected two species from the drop-down menus the user can choose to compare the peptidases in the genomes at the level of Family or MEROPS Identifier. With a comparison at the Family level, the number of distinct amino acid sequences that MEROPS is aware of for each family is shown, divided into "known or putative peptidases" and "non-peptidase homologues". Any cell showing differences that appear to us to be of particular interest is highlighted in pink. The comparison by MEROPS Identifier works in a similar way, but "unassigned" refers to the set of sequences that are apparently those of active peptidases, but cannot yet be assigned to specific identifers, and "homologues" is used for the sequences that lack catalytic residues.
The Summary page for each peptidase or inhibitor shows a "domain image". This is a representation of the structure of the complete molecule in which the recognised protein domains are shown as "beads on a string". The length of the string is proportional to the number of residues in the protein. A peptidase unit is shown in green and an inhibitor unit in turquoise; other colours are used for other domains, but domains of the same type are always shown in the same colour. Domains are shown as ovals, or squares when the number of residues is less than fifty. Residue numbers for the domains are visible in mouse-over text. Disulfide bridges are shown on the top edge of the diagram in broken lines. Active site residues shown on the bottom edge of the diagram are catalytic residues (red diamond on a stalk) and ligands for a catalytic metal ion (blue diamond on a stalk).
We make searches of an EST database with query sequences for all known members of a family of peptidase at one time. The procedure can be summarized as follows:
- Assemble a library of full-length amino acid query sequences for the peptidase family that contains amino acid sequences for all known examples from the species of interest together with selected other mammalian sequences.
- With TFASTX (Pearson et al., 1997), search a library containing all available ESTs for the given species with each of the query sequences.
- Merge information from all the result files, and re-format output so as to show for each query sequence the set of ESTs that match this better than any other.
- Divide the set of ESTs matching each query sequence into (a) ESTs that match to better than 95% identity, and so probably are the products of the known gene, and (b) those that match less well, and may be the products of novel homologous genes. (This step is done only when the query sequence was from the same species.)
- Assemble a sequence alignment for each subset and format this in HTML, marking features including extent of the peptidase unit and known catalytic residues.
- Collect data for each EST regarding IMAGE clone number, description of library from which the EST was derived, and Unigene cluster to which the EST is assigned, if any. Tabulate these data.
Special characters used by the TFASTX program remain in the alignments; these include symbols for frame-shifts ('\', '/'), 'x' for a stop-codon, and 'X' for an untranslatable codon.
MEROPS presents evolutionary trees for families and subfamilies and also for individual peptidases and inhibitors. For a family or subfamily, a difference matrix is calculated from the sequence alignment, and the percentage difference values are converted to Accepted Point Mutations (PAMs) according to the tables in Dayhoff et al., 1978. These data are then used as input for the QuickTree program (Howe et al., 2002) which generates evolutionary trees by use of the UPGMA algorithm. It is important to note that the relationships depicted are those between the peptidase units and take no account of the composition of the non-peptidase domains that are frequently attached to the peptidase units. A tree is displayed so that the most divergent sequences are towards the bottom of the page. We have excluded from alignments and trees all fragments of sequences shorter than half the length of the type example, and sequences that are not present in the SWISS-PROT, TREMBL, TREMBLNEW or PIR databases.
The sequences in each alignment and tree are numbered according to the key shown in the lower frame of the page. The key contains the name of the peptidase, the subtype, the organism, the SWISS-PROT or TREMBL accession number, and the MEROPS identifier. The type example for the family or subfamily is highlighted in green.
The method used for the construction of trees for the set of species variants of an individual peptidase or inhibitor is described under Sequence alignments.
There is a page for each organism from which a peptidase or protein inhibitor homologue has been included in MEROPS. If the genome of that organism has been completely sequenced then there will be a table at the bottom of the page to show unusual features relating to peptidases or protein inhibitor genes. The genome is compared those of its closest relatives (working up the taxonomic tree until there are at least three relatives in the same taxon with completely sequenced genomes), and comparison is made at the MEROPS family level. The following are noted.
- significant presence: present in the current genome but absent in 90% of closest related genomes.
- significant absence: absent in the current genome but present in 90% of closest related genomes.
- lineage specific expansion: the current genome has more paralogues than any of the other closely related genomes.
- lineage specific reduction: the current genome has less paralogues than any of the other closely related genomes.
The user should be aware that for a prokaryote whether the gene is on the chromosome or a plasmid is not taken into account. Data in MEROPS are collected at the species level, so if the genome has been determined from more than one strain the number of paralogues may be inflated because of genes present on strain-specific plasmids, or because the protein sequences from different strains are more divergent than expected (less than 95% identity) and each is considered a distinct entity.
The Literature pages contain citations of articles selected by the MEROPS team during their weekly searches. In total, the pages contain well over 30,000 references, and many of them include links to fuller information available through PubMed at NCBI, or by use of the digital object identifier, DOI, to the full article at the publishers' website.
We have marked some of the publications that are relevant to particularly important topics by use of coloured 'flags'. While these can be only a very rough guide, they may provide starting points for reading. The user can click on the link to one of these topics at the top of a Literature page to select only the relevant references. The topics we have flagged can be defined as follows:
- A Assay method,
- E recombinant Expression,
- I design of small-molecule Inhibitors,
- K gene Knockout or other artificial genetic manipulation,
- M natural Mutation, allelic variant or polymorphism,
- P Substrate specificity,
- R RNA splicing variation,
- S three-dimensional Structure,
- T proposed as a theraputic Target,
- U suggested to have theraputic potential itself,
- V review.
At least one ribbon diagram is included in the database for each peptidase family for which a three-dimensional structure has been published. Each of these (in the style of Richardson, 1985) has been constructed from a PDB entry, and where possible structures have been selected that include a bound small molecule inhibitor or substrate. We show alpha helices as red coils, beta sheets as green arrows, and coils and turns as cyan wires. Active-site residues and bound inhibitors and substrates are shown in ball-and-stick representation, and metal ions as Corey-Pauling-Koltun spheres. In some cases, the PDB entry has been edited so that only one subunit of a multimeric structure is shown. The diagrams were constructed by use of a series of programs. First, the RASMOL program (Sayle & Milner-White, 1995) was used to orient the molecule so that all the active-site residues were visible, to place molecules with similar structures in the same orientation and to calculate secondary structure according to Kabsch & Sander (Kabsch & Sander, 1983). The MOLSCRIPT program (Kraulis, 1991) was then used to generate an input file for the RENDER program (Bacon & Anderson, 1988) of the RASTER3D package (Merritt & Murphy, 1994).
A dynamically-generated, rotating Richardson image is also displayed. The image is displayed using the Astex viewer (Hartshorn, 2002) which is written in Java; to view these images you may have to install Java. The structural elements displayed in the static image are also displayed in the rotating image. There is a button underneath the image which when clicked will toggle a semi-transparent surface. Within the viewer, by holding down the left-hand button of the mouse it is possible to manually rotate the image. By clicking the right-hand mouse button the user has full access to all the Astex viewer commands. Please consult the Astex viewer manual for instructions on how to use the viewer.
An alignment at the family or subfamily level shows the parts of the sequences that overlap the peptidase unit of the type peptidase. Each alignment has been generated by use of the MAFFT (Katoh et al., 2002) or Clustal W (Higgins et al., 1996) programs, and edited to remove any sequences with active site residues (or metal ligands for metallopeptidases) misaligned. The alignment is numbered according to the preproprotein of the type example. Gaps are indicated by hyphens, and where these correspond to an insert in the type example, then the numbering reverts to a lettered sequence. The type example for the family or subfamily is highlighted in green. The alignment is annotated according to known or hypothetical features of the type example by colour high-lighting. Annotations include locations of active site residues, metal ligands, carbohydrate attachment sites, disulfide bridges and the extent of transmembrane domains. A key to the symbols used can be found at the bottom of each alignment.
The alignments for individual peptidases and inhibitors, and for holotypes at the family level, are produced dynamically by use of the MUSCLE multiple alignment program (Edgar, 2004). The dendrograms for the peptidases and inhibitors are produced from the MUSCLE alignment files by use of Sean Eddy's sreformat program and the QuickTree algorithm (Howe et al., 2002). The final tree is displayed by use of the ClustalTree applet written by Rodrigo Lopez and Stephen Robinson at the European Bioinformatics Institute.
Substrate specificity data
A vitally important property of any peptidase is its substrate specificity. It is at present not possible to predict peptidase specificity computationally for the vast majority of peptidases. This is because the substrates of peptidases have enormously complex structures. At the level of primary structure there are 20 amino acids (not just 4 bases, for example) together with many possible modifications of these. And then there are the levels of secondary structure and tertiary structure that have profound, but poorly understood, effects on the susceptibility of a polypeptide to the action of a given peptidase. So, the best we can do is to collect experimental data and present this as clearly as possible in the hope that it will give some indication of specificity as a basis for direct experimental testing. The data usually have to be collected from paper publications, which is a very time-consuming business, but we do as much as we can. Of course we want to give credit to the original source of each item of specificity data, so there is either a literature reference or a link to UniProt.
The specificity data are presented in MEROPS in five main ways. There are four queries that are available through the Searches page that respond to the questions:
- What peptidase can cleave this bond?
- How may this substrate be cleaved?
- What are the known cleavage sites in this protein?
- What cleavages does this peptidase make?
In addition to these, there is a 'Substrates' button on the summary page for each peptidase for which we have data, and this links the user to a page of known substrate cleavages for the peptidase in question. Each page lists substrates alphabetically and shows up to four residues either side of the scissile bond. The information may help in the design of test substrates and inhibitors, and in distinguishing the peptidase from others. Substrate cleavages are annotated as "physiological", "non-physiological" and "pathological" and it is possible to filter the list to show only one of these sets.
The query "What are the known cleavage sites in this protein?" allows the user to enter the UniProt (SwissProt or Trembl) accession for a protein and the display shows the protein sequence with known cleavage sites indicated by scissile bond symbols. By placing the mouse over a scissile bond symbol the user is presented with a menu of peptidases (with links to the relevant summaries) known to perform these cleavages. For non-physiological cleavages, the peptidase name is shown in italics. At the bottom of the page the user is presented with an option to align this substrate sequence with its close homologues (those that are in the same UniRef50 database entry, i.e. sequences that are more than 49% identical). On clicking this option, a MUSCLE alignment will be generated and the protein sequence of the substrate initially selected is highlighted with a green background. Above the alignment there is a row for each peptidase known to cleave the selected substrate where a scissile bond symbol is placed over the residue occupying the P1 position in the cleavage. Where the substrate is a portion of the complete protein sequence this is indicated by left and right pointing arrows showing the sequence range. Conservation of residues around the cleavage site is indicated for the other homologues, where identical residues are highlighted in pink, replacements known to be acceptible for the peptidase in question are highlighted in orange, and replacements not known to be acceptable to the peptidase are shown as white on a black background.
If there are no known cleavages for the selected substrate, or the known cleavages cannot be mapped to a specific peptidase, but cleavages are known for one or more close homologues (again one in the same UniRef50 database entry) then the user is presented with an option to display cleavages for each of these.
Specificity logos and matrices
The summary page for any peptidase with well-characterised specificity contains a 'logo' that is a diagrammatic representation of the specificity preference in each of the subsites P4 - P4'. To generate the logos, sequences around cleavage sites (ten or more) are aligned, and a hidden Markov model is generated. This is converted to a logo by use of the WebLogo package (Crooks et al., 2004). The preferred amino acid residues in each site are shown in the usual one-letter code, and colour-coded according to chemical type: green for polar amino acids (Gly, Ser, Thr, Tyr, Cys, Gln, Asn), red for acidic amino acids (Asp, Glu), blue for basic amino acids (Arg, Lys, His) and black for hydrophobic amino acids (Ala, Val, Leu, Ile, Pro, Trp, Phe, Met). Each cleavage pattern is also represented in plain text, with oblique strokes indicating the positions P4-P4’. Larger and smaller, upper- or lower-case letters are used to give an intuitive feel for the strength of the specificity preference.
The specificity matrix shows how frequently each amino acid has been found to occur in each position around the scissile bond. There is a column for each substrate residue P4 to P4', and a row for each amino acid (ordered by side chain characteristics: aliphatic hydrophobic, aromatic hydrophobic, uncharged, acidic and basic). The number of occurrences of each residue is shown in the cell, and the background of the cell is shaded more darkly when the number is large. This display has the advantage that amino acids not known to occur in a given cleavage site can be seen.
Combinatorial substrate analyses
There are techniques for determining peptidase specificity using positional scanning substrate combinatorial libraries comprising either fluorogenic or internally quenched substrates. Analyses of cleaved substrates is often automated and actual cleavage positions in individual substrates unknown. Poreba & Drag (2010) have reviewed the different techniques and are collecting the conclusions from the published literature. These have been made available to MEROPS. When data exists, a table is shown on the peptidase summary page beneath the specificity logos and matrices. Preferences in P4-P4' are shown with amino acids in single letter code plus 'n' for norleucine, 'Abu' for 2-aminobutyric acid, 'Amf' for 4-aminomethyl-Phe and 'Iaf' for 4-aminomethyl-N-isopropyl-Phe. The term 'broad' is shown when no preference exists in a binding pocket. The table also indicates the source organism of the peptidase, whether the peptidase was wild type or recombinant, the optimum substrate identified, the fluorophore or the donor-acceptor pair for an internally quenched substrate, and the reference.