To come back to the tutorial index, click here.
protein_pms.pdb
- the structure of the BPTI in PDB formatThe protein has already been protonated and the atoms names have translated to the ProtoMS naming scheme. You can read more about setting up structures here.
In protein_pms.pdb
, a box centred at (32, 6.6, 1.7) whose sides have lengths (5.2, 5.0, 8.8) encompasses a small cavity. The aim is to use GCMC to calculate the total affinity of water for that site. We can create a box to these specifications by typing
python2.7 $PROTOMSHOME/tools/make_gcmcbox.py -b 32 6.6 1.7 5.2 5.0 8.8 -o gcmc_box.pdb
The box we've just created and the small cavity in the protein look like:
The figure shows a slice through the surface of the protein, and gcmc_box.pdb
encompassing the small, sock-like cavity we're interested in.
As we wish to completely bind water to the volume specified by gcmc_box.pdb
, we must run a series of GCMC simulations at different chemical potentials (Adams value in ProtoMS), within which the average number of inserted waters ranges from 0 to the number of waters that would occur if the subregion were allowed to exchange molecules with bulk water. A good low Adams value to start with is around -35. To estimate the highest Adams value, we'll need the relation
B* = μ'hyd + ln ‹N*›,
where B* is the Adams value that produces the equilibrium average number of waters ‹N*› and μ'hyd is the excess chemical potential of bulk water, which is the hydration free energy of a single water molecule. In ProtoMS, previous analysis has found that μ'hyd approximately equals -6.2 kcal/mol. Thus, all we need to do is guess a maximum value of ‹N*› to get an upper Adams value. As the region specified by gcmc_box.pdb
can accommodate a maximum of about 5 waters, the above equations implies that B* ≤ -8.
Twenty-four cores were at our disposal for this tutorial. Therefore, we'll run GCMC simulations with Adams values at every integer between and including -32 and -9. To set-up the GCMC simulations, all we need to type is
python2.7 $PROTOMSHOME/protoms.py -s gcmc -sc protein_pms.pdb --gcmcbox gcmc_box.pdb --adamsrange -32 -9This has automatically solvated our protein in a droplet of water (
water.pdb
) by randomly placing waters up to bulk density. Any solvent water that was placed inside gcmc_box.pdb
has been removed to create water_clr.pdb
.
mpirun -np 24 $PROTOMSHOME/protoms3 run_gcmc.cmd
pymol out_gcmc/b_-9.000/all.pdbCheck your
warning
files as well to make sure nothing untoward has happened.
Before calculating occupancies and free energies with grand canonical integration, we should check to see if the simulations are approximately equilibrated. For one simulation, we can see the average number of inserted GCMC waters for each snapshot by typing
python2.7 $PROTOMSHOME/tools/calc_series.py -f out_gcmc/b_-9.000/results -s solventsonThis would have produced an estimate for the start of the equilibrated period. Check to see if that value matches what you see in the graph that gets automatically plotted.
For the rest of this analysis, we'll focus on the script calc_gci.py
which contains a lot of functionality. To see how the average number of waters varies with the applied chemical potential - in other words, a titration - type
python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -p titrationDepending on what you found with
calc_series.py
, you can discard the first X
snapshots of each simulation by additing the flag --skip X
. The titration plot should look something likeThe plot shows that the average number of waters at each Adams value occurs in 'steps', which is characteristic of all GCMC titration plots. Unlike the case when GCMC is performed on a cavity that can only bind a single water molecule (like here), the points of inflection of these steps do no necessarily correspond to free energies. As demonstrated in G. A. Ross et. al, Journal of the American Chemical Society, 2015, it's actually the area under the titration curve that is related to the free energy to transfer water from ideal gas to the simulated system.
To calculate the area under the titration curve, it is prudent to smooth over the data. The script calc_gci.py
can fit a curve by modelling the titration data as sum of logistic functions, which is equivalent to a very simple type of artificial neural network (ANN). As the titration data shows what looks like 2 steps, we can input that into the model. To calculate the fit with 2 steps and plot it, type
python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -p fit -c fit --steps 2The result looks like
The line of best fit is shown in red. The fit correctly captures the shape of the titration data, and looks good except for the plateau at 2 water molecules; the data point at B=-15 seems to have pulled the fitted plateau slightly higher than one would intuitively expect. This results from the fact the ANN was optimised by minimising the mean squared error, which is notorious for being overly influenced by outliers. We can try to improve the fit by trying to optimise a different "cost" function. The pseudo-Huber cost function puts less weight on outliers at the expense of an additional free parameter, denoted c. To use the pseudo-Huber cost function and to set c=0.1, we type
python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -p fit -c fit --steps 2 --fit_options "cost huber c 0.1"The new fit looks like this
The fit is qualitatively similar to when mean squared error was used as the cost function, but the plateau at ‹N›=2 is more cleanly represented. We'll ascertain whether the different fitting options will quantitatively affect the calculated free energies below.
To calculate the binding free energies of adding water to the cavity with the chosen fitted parameters, type
python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -c pmf --steps 2 --fit_options "cost huber c 0.1"where
-c pmf
indicates that the 'potential of mean force', i.e. the free energy, will be calculated. This will bring up a table like this one:
|----------------------IDEAL GAS TRANSFER FREE ENERGIES--------------------| |-BINDING FREE ENERGIES-| '# Waters' 'Mean' 'Std. dev.' '25th Percentile' 'Median' '75th Percentile' 'Mean' 'Median' 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 -16.07 0.15 -16.13 -16.12 -16.10 -9.87 -9.92 2.00 -31.57 0.15 -31.64 -31.61 -31.59 -19.17 -19.21 3.00 -40.30 0.18 -40.39 -40.32 -40.30 -21.70 -21.72The table shows the free energy (in kcal/mol) to transfer water from ideal gas (
IDEAL GAS TRANSFER FREE ENERGIES
) and from bulk water (BINDING FREE ENERGIES
) to the GCMC box. The script calc_gci.py
actually fits the ANN several times from different initial parameter values. The free energies are calculated for each fit, and from the ensemble of the calculated free energies the mean, standard deviation (Std. dev
), and the 25th, 50th (Median
), and 75th percentiles are calculated. When the titration data is particularly noisy, the median free energy is a more robust measure of the average free energy than the mean. The table indicates that the free energy to bind three waters from bulk water is -21.7 +/- 0.18 kcal/mol.
If the ANN was fitted by minimising the mean-squared error, the calculated binding free energy for this example would be -21.9 +/- 0.0 kcal/mol. While only 0.2 kcal/mol off the value calculated with the pseudo-Huber cost function, the error estimate (0.0 kcal/mol) is woeful. A way to estimate the sensitivity of the free energy on the titration is to use bootstrap sampling. In each bootstrap sample, the titration data is randomly sampled with replacement, the ANN re-fit, and the free energy calculated.
You should do as many bootstrap samples as possible, but this can take some time using the default fitting parameters. To speed up the bootstrapping, you can run just 1 random seed for each fit by typing fit_options 'repeats 1'
. Also, you can save the ensemble of fitted ANNs using the -o
flag. To do 1000 bootstrap samples, plot the fits with error bars, and save the ANNs, type
python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -p percentiles pmf -c pmf --steps 2 --fit_options "cost huber c 0.1 repeats 1" -b 1000 -o ANNs.pickleThe output,
ANNs.pickle
, can be read in by calc_gci.py
when you want to use the same fitted models again. The ensemble of titration fits looks like|----------------------IDEAL GAS TRANSFER FREE ENERGIES--------------------| |-BINDING FREE ENERGIES-| '# Waters' 'Mean' 'Std. dev.' '25th Percentile' 'Median' '75th Percentile' 'Mean' 'Median' 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 -16.20 0.55 -16.51 -16.25 -15.69 -10.00 -10.05 2.00 -31.79 0.88 -32.49 -31.75 -31.08 -19.39 -19.35 3.00 -40.56 1.03 -41.21 -40.56 -39.91 -21.96 -21.96The error to transfer/bind 3 waters has now increased to about 1 kcal/mol. The binding free energy has been automatically plotted and looks like
gcmc_box.pdb
.
To calculate the free energy to add a specific number of waters, say the free energy to bind 1 water when 2 are already bound, use the --range
flag. We'll input the 1000 boostrap fits with the -i
flag. We no longer need to specify the fitting options because the models have already been fitted. The binding free energy of the 3rd water can be calculated with
python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -c pmf -i ANNs.pickle --range 2 3with the result
|----------------------IDEAL GAS TRANSFER FREE ENERGIES--------------------| |-BINDING FREE ENERGIES-| '# Waters' 'Mean' 'Std. dev.' '25th Percentile' 'Median' '75th Percentile' 'Mean' 'Median' 2.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 -9.31 0.17 -9.39 -9.31 -9.23 -3.11 -3.11Note how the uncertainty has signicantly decreased. This is because we need only evaluate a smaller area for the relative calculation. The above tables shows that the free energy to bind the third and last water is -3.11 +/- 0.17 kcal/mol.