GCMC titration and grand canonical integration

Here, we'll go through the basics of setting up running a GCMC simulation to calculate the absolute binding free energy for a small network of waters. The example we use is a small binding site in BPTI that can bind 3 water molecules, which has been taken from the recent ProtoMS paper: G. A. Ross et. al, J. Am. Chem. Soc., 2015. Although it's a small site, exactly the same procedure can be followed to calculate the binding free energy of a large network.

To come back to the tutorial index, click here.

Prerequisite

The protein has already been protonated and the atoms names have translated to the ProtoMS naming scheme. You can read more about setting up structures here.


Simple setup

Setup
Before running GCMC, you must carefully select the region where you'd like to perform GCMC. During the simulations, waters will be inserted and deleted within that region, and the free energy to add or remove waters can be calculated using grand canonical integration. Therefore, you should make sure the GCMC region covers the volume you're interested in.

In protein_pms.pdb, a box centred at (32, 6.6, 1.7) whose sides have lengths (5.2, 5.0, 8.8) encompasses a small cavity. The aim is to use GCMC to calculate the total affinity of water for that site. We can create a box to these specifications by typing

python2.7 $PROTOMSHOME/tools/make_gcmcbox.py -b 32 6.6 1.7 5.2 5.0 8.8 -o gcmc_box.pdb

The box we've just created and the small cavity in the protein look like:

The figure shows a slice through the surface of the protein, and gcmc_box.pdb encompassing the small, sock-like cavity we're interested in.

As we wish to completely bind water to the volume specified by gcmc_box.pdb, we must run a series of GCMC simulations at different chemical potentials (Adams value in ProtoMS), within which the average number of inserted waters ranges from 0 to the number of waters that would occur if the subregion were allowed to exchange molecules with bulk water. A good low Adams value to start with is around -35. To estimate the highest Adams value, we'll need the relation

B* = μ'hyd + ln ‹N*›,

where B* is the Adams value that produces the equilibrium average number of waters ‹N*› and μ'hyd is the excess chemical potential of bulk water, which is the hydration free energy of a single water molecule. In ProtoMS, previous analysis has found that μ'hyd approximately equals -6.2 kcal/mol. Thus, all we need to do is guess a maximum value of ‹N*› to get an upper Adams value. As the region specified by gcmc_box.pdb can accommodate a maximum of about 5 waters, the above equations implies that B* ≤ -8.

Twenty-four cores were at our disposal for this tutorial. Therefore, we'll run GCMC simulations with Adams values at every integer between and including -32 and -9. To set-up the GCMC simulations, all we need to type is

python2.7 $PROTOMSHOME/protoms.py -s gcmc -sc protein_pms.pdb --gcmcbox gcmc_box.pdb --adamsrange -32 -9
This has automatically solvated our protein in a droplet of water (water.pdb) by randomly placing waters up to bulk density. Any solvent water that was placed inside gcmc_box.pdb has been removed to create water_clr.pdb.
Execution
To run the simulation, you need 24 cores and MPI. Execute by typing
mpirun -np 24 $PROTOMSHOME/protoms3 run_gcmc.cmd

Analysis
It is vital that you check the simulations by eye. If you have Pymol, you can check the output structures from one of the GCMC simulations by typing
 pymol out_gcmc/b_-9.000/all.pdb 
Check your warning files as well to make sure nothing untoward has happened.

Before calculating occupancies and free energies with grand canonical integration, we should check to see if the simulations are approximately equilibrated. For one simulation, we can see the average number of inserted GCMC waters for each snapshot by typing

python2.7 $PROTOMSHOME/tools/calc_series.py -f out_gcmc/b_-9.000/results -s solventson 
This would have produced an estimate for the start of the equilibrated period. Check to see if that value matches what you see in the graph that gets automatically plotted.

For the rest of this analysis, we'll focus on the script calc_gci.py which contains a lot of functionality. To see how the average number of waters varies with the applied chemical potential - in other words, a titration - type

python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -p titration 
Depending on what you found with calc_series.py, you can discard the first X snapshots of each simulation by additing the flag --skip X. The titration plot should look something like

The plot shows that the average number of waters at each Adams value occurs in 'steps', which is characteristic of all GCMC titration plots. Unlike the case when GCMC is performed on a cavity that can only bind a single water molecule (like here), the points of inflection of these steps do no necessarily correspond to free energies. As demonstrated in G. A. Ross et. al, Journal of the American Chemical Society, 2015, it's actually the area under the titration curve that is related to the free energy to transfer water from ideal gas to the simulated system.

To calculate the area under the titration curve, it is prudent to smooth over the data. The script calc_gci.py can fit a curve by modelling the titration data as sum of logistic functions, which is equivalent to a very simple type of artificial neural network (ANN). As the titration data shows what looks like 2 steps, we can input that into the model. To calculate the fit with 2 steps and plot it, type

python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -p fit -c fit --steps 2 
The result looks like

The line of best fit is shown in red. The fit correctly captures the shape of the titration data, and looks good except for the plateau at 2 water molecules; the data point at B=-15 seems to have pulled the fitted plateau slightly higher than one would intuitively expect. This results from the fact the ANN was optimised by minimising the mean squared error, which is notorious for being overly influenced by outliers. We can try to improve the fit by trying to optimise a different "cost" function. The pseudo-Huber cost function puts less weight on outliers at the expense of an additional free parameter, denoted c. To use the pseudo-Huber cost function and to set c=0.1, we type

python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -p fit -c fit --steps 2  --fit_options "cost huber c 0.1" 
The new fit looks like this

The fit is qualitatively similar to when mean squared error was used as the cost function, but the plateau at ‹N›=2 is more cleanly represented. We'll ascertain whether the different fitting options will quantitatively affect the calculated free energies below.

To calculate the binding free energies of adding water to the cavity with the chosen fitted parameters, type

python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -c pmf --steps 2  --fit_options "cost huber c 0.1" 
where -c pmf indicates that the 'potential of mean force', i.e. the free energy, will be calculated. This will bring up a table like this one:
          |----------------------IDEAL GAS TRANSFER FREE ENERGIES--------------------|   |-BINDING FREE ENERGIES-|
'# Waters' 'Mean'  'Std. dev.'  '25th Percentile'       'Median'      '75th Percentile'    'Mean'        'Median'
  0.00      0.00      0.00            0.00                0.00               0.00            0.00           0.00
  1.00    -16.07      0.15          -16.13              -16.12             -16.10           -9.87          -9.92
  2.00    -31.57      0.15          -31.64              -31.61             -31.59          -19.17         -19.21
  3.00    -40.30      0.18          -40.39              -40.32             -40.30          -21.70         -21.72
The table shows the free energy (in kcal/mol) to transfer water from ideal gas (IDEAL GAS TRANSFER FREE ENERGIES) and from bulk water (BINDING FREE ENERGIES) to the GCMC box. The script calc_gci.py actually fits the ANN several times from different initial parameter values. The free energies are calculated for each fit, and from the ensemble of the calculated free energies the mean, standard deviation (Std. dev), and the 25th, 50th (Median), and 75th percentiles are calculated. When the titration data is particularly noisy, the median free energy is a more robust measure of the average free energy than the mean. The table indicates that the free energy to bind three waters from bulk water is -21.7 +/- 0.18 kcal/mol.

If the ANN was fitted by minimising the mean-squared error, the calculated binding free energy for this example would be -21.9 +/- 0.0 kcal/mol. While only 0.2 kcal/mol off the value calculated with the pseudo-Huber cost function, the error estimate (0.0 kcal/mol) is woeful. A way to estimate the sensitivity of the free energy on the titration is to use bootstrap sampling. In each bootstrap sample, the titration data is randomly sampled with replacement, the ANN re-fit, and the free energy calculated.

You should do as many bootstrap samples as possible, but this can take some time using the default fitting parameters. To speed up the bootstrapping, you can run just 1 random seed for each fit by typing fit_options 'repeats 1'. Also, you can save the ensemble of fitted ANNs using the -o flag. To do 1000 bootstrap samples, plot the fits with error bars, and save the ANNs, type

python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -p percentiles pmf -c pmf --steps 2  --fit_options "cost huber c 0.1 repeats 1" -b 1000 -o ANNs.pickle 
The output, ANNs.pickle, can be read in by calc_gci.py when you want to use the same fitted models again. The ensemble of titration fits looks like

The orange and grey areas indicate where 50% and 90%, respectively, of the bootstrapped fits lay. The error can be reduced by running more GCMC simulations, particularly with Adams values around the points of inflections. The table of free energies that corresponds to the boostrap sampling looks like
          |----------------------IDEAL GAS TRANSFER FREE ENERGIES--------------------|   |-BINDING FREE ENERGIES-|
'# Waters' 'Mean'  'Std. dev.'  '25th Percentile'       'Median'      '75th Percentile'    'Mean'        'Median'
  0.00      0.00      0.00            0.00                0.00               0.00            0.00           0.00
  1.00    -16.20      0.55          -16.51              -16.25             -15.69          -10.00         -10.05
  2.00    -31.79      0.88          -32.49              -31.75             -31.08          -19.39         -19.35
  3.00    -40.56      1.03          -41.21              -40.56             -39.91          -21.96         -21.96
The error to transfer/bind 3 waters has now increased to about 1 kcal/mol. The binding free energy has been automatically plotted and looks like

The blue and grey area show the 50% and 90% confidense intervals, respectively, of the calculated free energies. From this graph, it's clear that the minimum free energy state of the system is with three waters bound. Thus, three is the optimal number of waters for the region encompassed by gcmc_box.pdb.

To calculate the free energy to add a specific number of waters, say the free energy to bind 1 water when 2 are already bound, use the --range flag. We'll input the 1000 boostrap fits with the -i flag. We no longer need to specify the fitting options because the models have already been fitted. The binding free energy of the 3rd water can be calculated with

python2.7 $PROTOMSHOME/tools/calc_gci.py -d out_gcmc/b_-* -c pmf -i ANNs.pickle --range 2 3 
with the result
          |----------------------IDEAL GAS TRANSFER FREE ENERGIES--------------------|   |-BINDING FREE ENERGIES-|
'# Waters' 'Mean'  'Std. dev.'  '25th Percentile'       'Median'      '75th Percentile'    'Mean'        'Median'
  2.00      0.00      0.00            0.00                0.00               0.00            0.00           0.00
  3.00     -9.31      0.17           -9.39               -9.31              -9.23           -3.11          -3.11
Note how the uncertainty has signicantly decreased. This is because we need only evaluate a smaller area for the relative calculation. The above tables shows that the free energy to bind the third and last water is -3.11 +/- 0.17 kcal/mol.

Written by Gregory A. Ross, 2015.