GCMC binding free energy of a water molecule to BPTI

This tutorial will cover the basics behind setting-up, running and analysing GCMC simulations that calculate the binding free energy of a single water molecule to a small cavity in bovine pancreatic trypsin inhibitor (BPTI). BPTI is the first protein ever to be simulated using molecular dynamics, which makes it a very worthy model system. This tutorial is based on some of the work that appears in G. A. Ross et. al, J. Am. Chem. Soc., 2015. Should you find this tutorial useful in your research, please cite it.

To come back to the tutorial index, click here.

Prerequisites

The protein has already been protonated and had it's atoms in the ProtoMS naming scheme. Note that there are 3 disulfied bonds in the protein, which are labelled as CYX. The water molecule is the Tip4p model, and is already located in the BPTI cavity. You can read more about setting up structures here.

We will be performing GCMC on the single water molecule, wat.pdb, at a range of different chemical potentials. During these GCMC simulations, the water will be instantaneously switched "on" and "off", and from the average occupancy of the water as a function of chemical potential, we will estimate it's binding free energy. As you'll see below, we actually use something called the "Adams" value rather than the chemical potential. This is for technical reasons, and the two are related to each other by multiplicativen and additive constants.


Simple setup

Setup
First, let's create a box around our water of interest. This will constrain the GCMC water to the cavity and stop any solvent water molecules from entering.

python2.7 $PROTOMSHOME/tools/make_gcmcbox.py -s wat.pdb

This has created a file called gcmc_box.pdb. Next, we'll use the automatic capabilities of protoms.py to do the rest of the set-up.

python2.7 $PROTOMSHOME/protoms.py -sc protein_pms.pdb -s gcmc --gcmcwater wat.pdb --gcmcbox gcmc_box.pdb --adams -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21 -22 -23 -24 -25 -26 --capradius 26

We've input which Adams values (i.e. chemical potentials) we will use with --adams. These can be chosen after some preliminary analysis that we'll come to later. As the protein was already in the ProtoMS format, we gave it the flag -sc (for scoop). We also specified the radius of the water droplet with --capradius; the default radius in protoms.py is 30 Angstroms, which is unnecessary for such a small protein.

Have a look at the simulation system we've created with your favourite molecular viewer. For instance, with vmd:

vmd -m protein_pms.pdb water.pdb wat.pdb gcmc_box.pdb

You should see something like this

While your visualising the system, have a look at the cystein bridges. These need to be restrained in ProtoMS which protoms.py will do automatically when preparing proteins. To see that protein_pms.pdb specifies which residues to restrain, open it up and see

REMARK chunk fixresidues 1 5 14 30 38 51 55
REMARK chunk fixbackbone 1 4 6 13 15 29 31 37 39 50 52 54 56
This fixes the cystein residues, as well as the neighbouring residues. If the above isn't included, the cysteine bonds will break due to a quirk in the backbone sampling of ProtoMS.

Execution
Notice that in the command file run_gcmc.cmd file there is the line
multigcmc -11.000 -12.000 -13.000 -14.000 -15.000 -16.000 -17.000 -18.000 -19.000 -20.000 -21.000 -22.000 -23.000 -24.000 -25.000 -26.000
These sixteen Adams values mean that sixteen cores will be needed to run these simulations. ProtoMS is designed to run mpi, so to execute, enter
mpirun -np 16 $PROTOMSHOME/protoms3 run_gcmc.cmd
This will take approximately five to six hours to run. The length of time and number of jobs means that it is more convenient to run on a computer cluster than your work-station.

Analysis
Visually inspect your simulations in VMD or your favourite viewer and make sure nothing untoward has happened. Unlike the viewer Pymol, VMD has a problem with simulations that change the total number of molecules. To view a movie of the simulation, we must remove the water molecules from the simulation. In one of the folders within out, try
grep -v WA1 all.pdb > all_nowat.pdb
vmd all_nowat.pdb
and make sure the simulation looks okay.

We can look at the occupancy of the water molecule in each of the simulations. At less negative Adams values, the water has a greater probability to be inserted for the majority of the simulation. In simulations with low Adams values, the water may completely vacant from the cavity. Intermediate Adams values will produce a large number of insertions and deletions. Lets have a look at one such intermediate value. For instance, type

python2.7 $PROTOMSHOME/tools/calc_series.py -f out_gcmc/b_-19.000/results -s solventson
You may see something like this

Have a look at the corresponding plots for the other simulations.

As demonstrated in G. A. Ross et. al, Journal of the American Chemical Society, 2015, the relationship between the Adams value and the average occupancy of the GCMC region can be used to calculate the free energy to transfer the water molecule from ideal gas to the GCMC region via

N› = 1/(1 + exp(βΔFtrans - B)),

where ‹N› is the average number of water molecules, B is the Adams value, ΔFtrans is the free energy to transfer the water molecule from ideal gas, and β is 1/kBT, with kB denoting Boltzmann's constant, and T the temperature. This equation - a logistic equation - takes the form of a smooth step function, with the point of inflection given by ΔFtrans. While the transfer free energy can be estimated by eye, the most accurate way to calculate ΔFtrans is to fit the logistic equation to the titration data, as ΔFtrans is the only unknown in the equation. The transfer free energy can be calculated by typing

python2.7 $PROTOMSHOME/tools/calc_gcsingle.py -d out_gcmc/b_-* -p
where the -p flag specifies that titration data and fitted logistic curve will be plotted. The script calc_gcsingle.py estimates the error in the calculated free energy by performing 1000 bootstrap samples of the titration data and refitting the logistic equation each time. The printed results should look something like:
FREE ENERGY ESTIMATES:
  Least squares estimate = -10.57 kcal/mol
  Bootstrap estimate     = -10.57 +/- 0.19 kcal/mol
The first result, Least squares estimate, is simply the least squares fit of the transfer free energy, and has no error associated with it. The second result, Bootstrap estimate is the estimate that has been calculated via bootstrap sampling. The value quoted is the mean transfer free energy from all bootstrap samples, and the error is the standard deviation from those samples. To get the binding free energy of water for the GCMC region, simply take the difference of the transfer free energy and hydration free energy of water. ProtoMS calculates the hydration free energy of water to be about -6.2 kcal/mol, so the binding energy is roughly -10.6 + 6.2 = -4.4 kcal/mol.

Two plots will have been produced when calling calc_gcsingle.py. Like the results, one shows the least squares fit of the logistic function, and the second shows the results of the bootstrap sampling and fitting, and is probably more informative. For the latter, you may see something like

The red line marks the median of the all the bootstrap fits, the orange and grey areas contain 50% and 90% of all the fits respectively. To be clear, these errors illustrate the uncertainty of the fit, not the GCMC titration data. The error produced from the bootstrap sampling can be reduced with data more GCMC simulations, particularly at Adams values near the point of inflection.

Estimating the excess chemical potential
Each simulation at a different Adams value can be used to estimate the excess chemical potential of the water molecule in the protein cavity via the equation

B = μex + ln ‹N›,

where B is the Adams value, ‹N› is the average number of water molecules, and μex is the excess chemical potential. As described in J. Chem. Inf. Model., Vol. 54, pg. 1623, it is possible to use an estimate of μex to approximate ΔFtrans. This approximation is only valid when ‹N› approches zero. Unfortunately, at low Adams values - when the occupancy of the water is very low - the data becomes noisy, so that the recomended way to estimate ΔFtrans is via fitting the logistic function as described above. We can start estimating the excess chemical potential at each Adams value by typing

python2.7 $PROTOMSHOME/tools/calc_gcsingle.py -d out_gcmc/b_-* --excess
which brings up a plot for the excess chemical potential. As you can see, the line appears to become flatter but noiser at low Adams values. It is in this regime that one can approximate ΔFtrans. You will then be prompted to input the range of Adams values from which the excess chemical potential will be estimated. You should chose the range over which the estimates for excess chemical appears constant. For instance, for the plot

One could enter
 -17 -24 
as the line seems flatest between these values. The excess chemical potential is averaged over this range, which produces our estimate for the coupling free energy. You should predict ΔFtrans to be about -10.5 kcal/mol.

Exploring more options

Here, we'll consider some points that will help with the set-up and execution of GCMC on a single water molecule.

Setting the Adams values
The Adams parameter affects the probability to accept and insert or deletion move. Lowering the Adams value makes it more likely to empty the GCMC simulation region, while raising the Adams value increases the occupancy of the cavity. If only one value is entered, for instance
python2.7 $PROTOMSHOME/protoms.py -sc protein_pms.pdb -s gcmc --gcmcwater wat.pdb --gcmcbox gcmc_box.pdb --adams -11 
the command file run_gcmc.cmd will have the line
potential -11.000
On the other hand, if one were to enter multiple Adams values, for instance
python2.7 $PROTOMSHOME/protoms.py -sc protein_pms.pdb -s gcmc --gcmcwater wat.pdb --gcmcbox gcmc_box.pdb --adams -11 -12
one would find this line in run_gcmc.cmd:
multigcmc -11.000 -12.000 
The difference between these two commands determines whether one needs to run ProtoMS with MPI. For one Adams value, ProtoMS is executed with
$PROTOMSHOME/protoms3 run_gcmc.cmd
With more than one value, ProtoMS is executed with
mpirun -np 16 $PROTOMSHOME/protoms3 run_gcmc.cmd

were 16 would typically be the number of b values in the simulation. When choosing which Adams values to use in ones simulations, note βΔFtrans is the point of half maximum of the logistic curve. Reliable transfer free energy estimates can be achieved with the Adams values around this point.

start by simulating with values between -16 and 0 and analysing the results. You can choose more Adams values to simulate with if you didn't get reliable free energy estimates from the runs you already have.

Running independent repeats
Multiple independant repeats are essential for well resolved free energy estimates. Every GCMC simulation counts as an individual repeat, and one can always run additional simulations at different Adams values. If you're sure of the Adams value(s) from the start and want, say, 3 repeats, typing
python2.7 $PROTOMSHOME/protoms.py -sc protein_pms.pdb -s gcmc --gcmcwater wat.pdb --gcmcbox gcmc_box.pdb --adams -19 -20 -21 -r 3
will create 3 command files called run1_bnd.cmd, run2_bnd.cmd, and run3_bnd.cmd, which, when executed, create the output folders out_gcmc1, out_gcmc2, and out_gcmc3. Analysing the results from multiple directories is trivial; all you need to do is enter the directories that contain the ProtoMS results files. For instance, one can type
python2.7 $PROTOMSHOME/tools/calc_gcsingle.py -d out_gcmc1/b_-* out_gcmc2/b_-* out_gcmc3/b_-*
to analyse all the simulation data in one go.
Setting the move proportions for GCMC
The default behaviour in protoms.py when assigning move proportions is to dedicate half of all trial moves to grand canonical solute moves. A typical run_gcmc.cmd created by protoms.py for BPTI will contains the lines
chunk equilibrate 5000000 solvent=0 protein=0 solute=0 insertion=333 deletion=333 gcsolute=333
chunk equilibrate 5000000 solvent=440 protein=60 solute=0 insertion=167 deletion=167 gcsolute=167
chunk simulate 40000000 solvent=440 protein=60 solute=0 insertion=167 deletion=167 gcsolute=167
Note how the numbers to the right of each "=" sign roughly add up to 1000.The terms insertion, deletion, and gcsolute are specific to GCMC. Let's look at the last two chunks. As 167+167+167≈500, the proportion of moves dedicated to GCMC is 500/1000=1/2. This default behaviour was designed for cases when the GCMC region would contain tens of water molecules, and not just one water molecule as we have with BPTI. By trailing too many moves on the single water molecule in the BPTI case, we run the risk of undersampling the rest of the system with respect to the water molecule. While that isn't the case with BPTI, it may be for other systems, so it's worth experimenting with different move proportions. Altering the last two chunks to read
chunk equilibrate 5000000 solvent=440 protein=60 solute=0 insertion=50 deletion=50 gcsolute=50
chunk simulate 40000000 solvent=440 protein=60 solute=0 insertion=50 deletion=50 gcsolute=50
means that one fifth of all moves will be dedicated to sampling the grand canonical water molecule. The free energy calculated from such simulations is not significantly different to what we calculated before for BPTI, which lends confidense in our estimate.


Files created by the setup script

Simulation specific files
Running make_gcmcbox.py and protoms.py created the following files: If there were any molecules from water.pdb that were within the GCMC box, they would be removed and the file water_clr.pdb will be created as well.


Setting up the simulation with individual tools

Here, we'll go through through the set-up of the ProtoMS files step by step, we'll also d without the help of the automatisation in protoms.py. Doing so will grant us some extra flexiblity.
Setting up the GCMC box
Before running GCMC, you have to be very clear what region you want to allow GCMC moves. In the example BPTI example at the top of the page, we used make_gcmcbox.py to delineate the volume around the structure wat.pdb. If you don't have such a structure to draw a box around, you can also specify the size and centre of the box instead. For instance,

python2.7 $PROTOMSHOME/tools/make_gcmcbox.py -b 32.67 4.32 10.34 -p 2 -o gcmc_box.pdb 

will create gcmc_box.pdb centered around x=32.67, y=4.32 and x=10.34, with 2 Angstroms either side of the centre such that the box has a volume 4×4×4 Angstroms cubed. Alternatively, one can create the same box by specifying

python2.7 $PROTOMSHOME/tools/make_gcmcbox.py -b 32.67 4.32 10.34 4 4 4 -o gcmc_box.pdb 

Here, we've explicitly stated the size of the box in each direction, meaning we could create a cuboid by giving different box lengths in each direction, as opposed to the cube with the -p flag.

Solvating the system
To solvate the protein with a sphere of water, up to a radius of 26 Angstroms, one can enter the following:

python2.7 $PROTOMSHOME/tools/solvate.py -pr protein_pms.pdb -g droplet -r 26 -b $PROTOMSHOME/data/wbox_tip4p.pdb -o water.pdb

This makes water.pdb. We need to specify the water molecules that we'll be sampling with GCMC. We'll use solvate.py again to fill the volume specified by gcmc_box.pdb.

python2.7 $PROTOMSHOME/tools/solvate.py -pr gcmc_box.pdb -g flood -b $PROTOMSHOME/data/wbox_tip4p.pdb -o gcmc_wat.pdb

with output gcmc_wat.pdb. When running Monte Carlo, ProtoMS will not allow any bulk water molecules (specified in water.pdb) to enter or leave the dimentions specified in gcmc_box.pdb. So that we don't block the movement of the grand canonical water molcules (specified in gcmc_wat.pdb), we need to remove any block water molecules that are with the box:

python2.7 $PROTOMSHOME/tools/clear_gcmcbox.py -b gcmc_box.pdb -s water.pdb -o water_clr.pdb

In the BPTI test case, no water molcules in water.pdb are within the GCMC region, so no waters are cleared and, in this case, we don't need the created file water_clr.pdb.

Making the ProtoMS input file
The make the input file for the Monte Carlo simulation, we reference the files created thus far

python2.7 $PROTOMSHOME/tools/generate_input.py -p protein_pms.pdb -pw water.pdb --gcmcwater gcmc_wat.pdb --gcmcbox gcmc_box.pdb --adams -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21 -22 -23 -24 -25 -26 -o run

creating run_gcmc.cmd.

Written by Gregory A. Ross, 2015.