Computational Biology Center
MSKCC Logo
proteinkeys
version 1.5.0 - trunk
sign in
|

Specificity Residues Prediction Analysis Workflow

Specificity Residues Prediction Scheme

The Method

The specificity residues prediction method was originally reported at ISMB-2004 (PowerPoint presentation, 0.7mb)

The service determines the specificity residues in protein families and interprets them in the context of 3D structures. The prediction method is based on piding a multiple sequence alignment of a protein family into subfamilies in such a way that it provides maximum information about functional specificity on the background of overall conservation of general function. The method uses an entropy measure of conservation and variation and solves a complex optimization problem. To achieve optimal subpision, the combinatorial entropy of the distribution of residues between subfamilies is computed for each column of the multiple sequence alignment. This entropy is compared to the entropy of a random distribution of residues in the column. The optimal subpision corresponds to the maximal entropy difference (real minus random), summed up over all columns.

Ordered by the entropy differences, the alignment columns (residue positions) fall apart into 3 categories: 'specificity', 'neutral', 'conserved'. The entropies of the 'neutral' columns fit a linear function; the 'specificity' and 'conserved' groups of columns are characterized by non-linear entropy changes. The 'conserved' columns have the entropy differences close to zero, while the specificity columns have the lowest entropy differences. The numbers of 'conserved' and 'specificity' columns can be determined from the entropy curve or specified by a user. The statistical significance of the predictions is determined by computing a probability of obtaining the observed number of specificity residues in the interface of a given protein complex at random.

Optimization Parameters

Since specificity residues form sequence motifs that are specific for clusters of sequences to achieve optimal separation of the sequences of a given alignment into clusters, we recommend to repeat computations with different values of the optimization coefficient A (0<A<1). 'A' controls the portion of small clusters in the optimization procedure. Taking different A's, one can produce essentially different trajectories of clustering by favoring clusters of different sizes and, hence, exploring the significant region of the clustering space in searching for a sub-optimal solution of the task. A is a parameter that has to be optimized for a particular multiple alignment.

Typically, the optimal value of A falls between 0.65 and 0.85, therefore a default value of A is taken to be 0.75. A user can optimize clustering by trying different A's; to do this, one needs to set 3 numbers: A min , A max and N steps ; then clustering will be done N times with A k =A min + k (A max -A min )/N steps , where k=0, 1, 2, ...N steps -1;

How to create a multiple sequence alignment compatible with specificity residues prediction (SRP) analysis:

SRP service expects a multiple sequence alignment of a single protein family (Not the whole protein for multi-domain proteins). The optimal alignment should be large, dense and perse.

The most preferable way is to send your sequence to a "sequence search" window of a server ( Superfamily, PFAM, SMART ), then the server will predict a domain composition of a given sequence; for each of the found domains, the server will provide with a multiple alignment of homologous sequences.

Using Superfamily to create a multiple sequence alignment suitable for SRP service

Find your protein in Superfamily by doing a protein name search (Type your protein name in search field of the first page of the Superfamily server - the keyword search). For example type "p16" and hit enter. Superfamily will find 4 domains present on the protein p16. Choose the last one named 'Ankyrin repeat'. Click the 'Alignments' link that should appear to get alignment options for this domain. Now, you must select a reference sequence to align against. The 'Model' '1awc B:' domain is selected by default, you can try others as well. In the section called 'Display Options', set 'Max number of insertions show' to zero, 'Characters per line' to 10000 and 'Maximum number of sequences' to 2000. This is done to create an alignment with all known sequences (probably not more than 2000 exist) that will likely not be wrapped (proteins are likely not going to be longer than 10000 characters) Scroll down to the list of species and select Homo Sapiens. Hit 're-submit'. Copy the resulting alignment text only (Note: you must not copy the ruler symbols on the top of the alignment, just the proteins). Paste this alignment into SRP server window.

Filtering your alignment

We recommend adjusting your alignment, so that a reference sequence (a query sequence) would have no gaps or deletions in the original alignment file. Some editing, in particular, removing sequences with gaps, removing unknown residues, removing redundant sequences can be done using the SRP server using the "Filter your alignment" page. We recommend removing sequences with more than 10-20% gaps. We also recommend removing sequences with similarity of 90-95% or higher to other sequences in the alignment. Sometimes in the alignment, the reference sequence may have long N- and C- termini with gappy columns. One can remove these gappy columns first by using the "remove column" button, and then adjust the rest of the alignment removing gappy and redundant sequences.


Copyright 2006-2008, Computational Biology Center
Memorial Sloan-Kettering Cancer Center