Hi Cale,

> For any given structure in the PDB, I want to identify all the Histidine ND1
> atoms.  I then want to consider these atoms in pairs, measure the distance
> in Angstroms between the ND1 atoms in each pair, and compile these distances
> (along with residue numbers of the pair) in a table.  I then want to repeat
> this procedure for each unique structure in the PDB and generate a table
> containing all occurrences of HisND1 pairs with their corresponding
> separation distance.  Amongst other things, I want e.g. generate a histogram
> from this table and determine e.g. the shortest HisND1 pair distance
> observed and the structure in which this happens.  Does anyone have any
> suggestions for any tools I might be able to use to perform this search?

This looked like a fun little PyMOL task.  So, I wrote a script to do
it.  This will iterate over all the PDBs in a given directory and
calculate the non-redundant distances of all HIS ND1s.  A table is
written to pre_hist.csv.  Use R or something similar to create the
histogram and do the numerical analysis.

To use this, first, you'll need a prepared copy of the PDB, suitable
for this task, downloaded locally in pdb_path.  You will need to set
"pdb_path" in the script below to wherever your PDBs are.  (The other
kind folks on this list already pointed you to resources for mirroring
the PDB, if you haven't done it already.)  To run the script below
just save it to disk as "" and launch it with:

# from Linux
pymol -cq

# or, from Mac
/Applications/ -cq

If you're a Linux/Mac user, when the script finishes, just type:

sort -n -k8  pre_hist.csv | head -2

into your BASH shell to get the shortest distance.  For my example PDB it was:

$ sort -n -k8  pre_hist.csv | head -2
5rla   B        187     3729      C        312     7075      2.838208

To view the shortest distance use the ATOM-A and ATOM-B fields from
the output file.  From my example above I would need to fetch the
protein and create a distance from "index 3729" to "index 7075".
Here's how:

# fetch the protein
fetch 5rla, async=0

# show the distance
distance index 3729, index 7075

# zoom on the new distance
zoom dist01

Here's the Python script that does the work:

import glob, os, pymol, sys
from pymol import cmd

files = glob.glob(the_pdb+os.sep+"*.pdb")

if not len(files):
    print "Please set 'the_pdb' variable to a valid path containing PDB files."
    print "Processing %d files." % len(files)

s, outFile = "resn HIS and name ND1", "pre_hist.csv"

f = open(outFile, 'wb')
# write the header
# for each file in the mirror
for x in files:
    n = cmd.get_names()[0]
    m = cmd.get_model(s).atom
    # pairwise for each atom
    for aa in m:
        for bb in m:
            # avoid duplicates
            if aa.index==bb.index: continue
            f.write( "%s\t%s\t%s\t%s\t%s\t%s\t%d\t%f\n" %
                     (n, aa.chain, aa.resi, aa.index,
                         bb.chain, bb.resi, bb.index,
                         cmd.get_distance( "index %s" % aa.index,
"index %s" % bb.index)))

print "Processed %d files.  Please see %s for results." % (len(files), outFile)


-- Jason

Jason Vertrees, PhD
PyMOL Product Manager
Schrodinger, LLC

(e) [log in to unmask]
(o) +1 (603) 374-7120