The 10 steps to Transgenic Fold Prediction
The figure below gives an overview of the algorithm and shows the various programs involved in each step. The whole procedure has been implemented in an easily parallelizable way,
and for CASP4 a cluster of 26 PCs at the CMBI was used to do the calculations. The following points briefly summarize the
10 major steps of the modeling method: 1) Generation of initial alignments:
Output from various programs is collected and added to the
"alignment soup": a) Fold recognition software: GenThreader [1], 3DPSSM [2] and BIOINBGU
[3]. b) Classical alignments of the target sequence against the PDB: Smith&Waterman running on Compugen's Bioccelerator,
and StretchScan, a tool that searches for small stretches of identical residues. As the cutoff is set at high e-values
(to get data also for ab initio targets), both methods produce lots of unreliable alignments deep in the
"twilight zone". The output is therefore filtered by SecMatch, a program that extracts
(for every aligned PDB sequence) the DSSP secondary structure from our new PDBFINDER II data base and compares it to the PSI-PRED
[4] prediction for the target sequence. Alignments with severe mismatches in high-confidence regions are thrown away.
2) Building models for the initial alignments:
Initial models are built by WHAT IF [5] using a protocol described by G.Chinea
et al. [6] As alignments are taken from the "soup" obtained in step 1,
different templates and different alignments for the same template are automatically considered here.
3) Side chain debumping:
A short energy minimization is done by YASARA, with the backbone kept fixed to make sure that the protein conformation itself stays untouched,
while bumps are reduced and residue packing improves. 4) Start of the optimization cycle - model verification: The WHAT IF module for structure validation WHAT_CHECK [7] is used to generate a
"PDB-report" for each model (reports for the entire PDB are available at www.cmbi.kun.nl/gv/pdbreport). 5) Creation of a residue specific fitness matrix:
The WHAT_CHECK output is converted to a fitness matrix by WHAT_MODELBASE. This matrix assigns at present
16 different quality estimates to every residue, ranging from simple bond length checks to three dimensional packing quality
(fig.2). Weighted averages to obtain per- residue- and per- model- scores are included as well.
6) Transgenic impression:
YASARA uses the fitness matrix to judge the model. It then generates a genotype that
- when expressed - reproduces the various good aspects of the phenotype (i.e. the model) but omits the bad ones. At this stage,
features acquired during "lifetime" (the molecular dynamics simulations in step
9) can be propagated back to the genome. This genome contains one gene per residue,
and each gene consists of up to 256 distance restraints. Every restraint defines a distance between two atoms in the molecule
(one of which always belongs to the residue associated with the gene) and a stretching force constant,
that is derived from the fitness matrix (so that very good features are restrained more tightly). The restraints are clustered in seven atomic mutation units
(i.e. mutations always affect the entire unit): Backbone, rotamer, packing, disulfide bond,
secondary structure, known distance and protonation pattern. The newly generated genome is then added to the gene pool sorted according to model scores.
7) Mating and mutation:
This step is based on usual genetic procedures: From the pool of genomes,
two or more (in case they are incomplete due to short initial alignments) are chosen and combined to one descendant by multiple crossovers. The main differences to standard genetics are that crossovers always affect entire genes and that their location and frequency is influenced by the local gene quality
(i.e. the residue score). Several types of mutations can be introduced, one for every mutation unit listed in step
6. Mutations might for example change the length and position of secondary structure elements,
shift beta strands along each other or change the protonation pattern of histidine residues.
8) Gene expression: The distance information contained in the genes must now be translated to the three dimensional structure of the
"protein child". YASARA uses three different distance geometry approaches (which are borrowed from its
NMR module) to achieve this goal: a) Simulate a "Virtual Ribosome
", b) start with randomized coordinates (both for ab initio folding) and c) take the parent structure as an approximate guess
(a trivial method when a template is available). At this stage, insertions and deletions are added to the model.
9) "Life" of the molecule - acquiring new abilities and features:
Up to ten molecular dynamics and simulated annealing runs are done for every model,
using the all-atom YASARA NOVA97 Force Field, which has been designed for structure optimization
in vacuo, based on statistical analysis of known proteins. Off-center point charges improve the modeling of hydrogen bonds,
reduced ionic charges mimic the shielding of these groups by water molecules. As we do not know the true folding force field,
every MD or annealing procedure is expected to move the protein at least a bit away from reality,
that is why the distance restraints from the genome are part of the simulations
- they keep the good parts of the model in place. 10) Gene therapy: Even though special care is taken by the crossover algorithm to produce a consistent genome without conflicting genes
(i.e. genes containing distance restraints that cannot be fulfilled at the same time,
as they come from different parents), a small risk always remains. Contradictory restraints cause conformational stress in the protein,
which is detected at this stage. The associated genes are located and "cured".
From here the transgenic fold prediction cycle either goes back to step 9 for another MD run or starts again at step
4. The best models can be extracted after step 5, the process continues until convergence is reached.
// 1crn.cdb - WHAT_MODELBASE Quality Check
Sequence : TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN|
Access : 4100347612540561477626355017624034295783476263| 0.4285
Quality : 0.7425
Bonds : 9899897999989989999998698897979999892979999999| 0.9246
Angles : 2699996999908089899798899898679996696737996929| 0.7963
Torsions : ?64644647678687546352666375655764644423755543?| 0.5475
Phi/psi : ?45545655666665434455646453453654644433754335?| 0.4937
Planarity : 9999999999999999999999999999999999999999999999| 1.0000
Chirality : 9899999999999999999999999999999999999999999979| 0.9901
Backbone : ??999999999999999919799999999996999986999999??| 0.9476
Peptide-Pl : ??9999999999999999?9?9999999999?8999??999999??| 0.9944
Rotamer : ??559959954767969699?9699899799?9849999799899?| 0.8573
Chi-1/chi-2: 4444955494574644459949693494549546599944994495| 0.6341
Bumps : 9999909099999999904990999999099999999999999999| 0.7159
Packing 1 : 6489397666677654013238656653014663314314631213| 0.4651
Packing 2 : 9466653456674655444547434744336843142536653532| 0.4934
In/out : 9999999999999999999999999999999999999999999999| 0.9998
H-Bonds : 9999979799999999999999999999999999999999999979| 0.9837
Flips : 9999999999999999999999999999999999999999999999| 1.0000
Fig.2: Residue specific fitness matrix for crambin
(1CRN). 16 different quality measures (from "Bonds" to "Flips") are considered by the transgenic folding algorithm. Quality ranges from
0 (bad) to 5 (average/normal) and 9 (perfect). A detailed description of the various quality indicators can be found at
http://www.cmbi.kun.nl/gv/checkhelp/
References:
1. D.T.Jones (1999) J. Mol. Biol. 287
: 797-815 2. L.A.Kelley, R.M.MacCallum and M.J.E Sternberg
(2000) J. Mol. Biol. 299(2): 501-522 3
. D.Fischer, Pacific Symp. Biocomputing, Hawaii, 119-130, January 2000.
4. D.T.Jones (1999) J. Mol. Biol. 292: 195-202. 5
. G.Chinea, G.Padron, R.W.W.Hooft, C.Sander and G.Vriend (1995) Proteins 23(3):415-421
6. G.Vriend (1990) J. Mol. Graph. 8
: 52-56 7. R.W.W.Hooft, G.Vriend, C.Sander, E.E.Abola (1996)
Nature 381:272-272.
|