JISCMail - CCP4BB Archives

I have now looked at James's two challenges to see what I could learn
from them, and will try to give enough details so that less experienced
readers of this list can repeat what I did and apply the experience
thereby gained to solving their own structures. For those who are not
interested in the details, the bottom line is that SHELXC/D/E can solve
both 'possible' and 'impossible' almost routinely, starting by finding
the substructure, without using any information derived from the known
structure. It should be emphasised that this does not produce a fully
refined structure, but the resulting poly-Ala trace of about 70% of the
structure and 'free lunch' maps showing many side-chains would be a good
starting point for programs (such as Buccaneer or wARP) that dock a
known sequence and complete the structure. My students would of course
be expected to complete the map interpretation themselves using the
excellent facilities available in Coot, that is always very educational!

I used the current SHELX beta-test programs that will shortly be
released as the official versions.

First i used Tim Gruene's mtz2sca to convert James's mtz files into a
format that SHELX can read, and then ran SHELXC from the command line to
make the files possible.hkl (native intensity data), possible_fa.hkl (h
k l FA and phase shift alpha) and possible_fa.ins (input file to run
SHELXD (and the same for 'impossible'). Alternatively I could have used
Thomas Schneider's hkl2map GUI to call SHELXC/D/E. I looked at the
<d"/sig> row to see where to cut the resolution for finding the heavy
atoms and decided on 3.5A (SHEL 999 3.5). If I had been able to input
unmerged data to SHELXC, e.g. as XDS_ASCII.HKL which is always unmerged,
I would also have obtained a CC1/2 value that would also indicate where
to cut the resolution. 3.5A corresponded to <d"/sig> of about 1.0 which
is still rather low, but cutting at even lower resolution tends to give
less accurate substructures. To compensate for this optimistic choice
for the rather weak anomalous data, I increased the number of trials
(NTRY) to 10000. These are the two most critical parameters for SHELXD,
and as it turns out, for the whole structure solution.

However before running the multi-CPU version of SHELXD, since the PDB
file of the refined structure was available, I ran AnoDe to use the PDB
file and anomalous data in possible_fa.hkl to check the substructure.
This told me that for both 'possible' and 'impossible' it should be
possible to find 12 well-defined sites, and also that the original
impossible.mtz was inconsistently indexed. AnoDe also outputs a list of
heavy atoms in SHELX format that can be input directly into SHELXE for
density modification and tracing. However that would be cheating because
AnoDe reads the final PDB file to calculate the anomalous density, and I
was trying to solve the structure without assuming the answer, even
indirectly. In general a substructure calculated in this way by AnoDe is
always much more accurate and complete that one found ab initio from the
anomalous data.

The best SHELXD solutions had CC 34.6 and CCweak 15.0 for 'possible' and
28.4/13.2 for 'impossible'. I always tell people to aim for at least
30/15, so maybe I should have done more than 10000 tries for
'impossible' but my wife was getting impatient (I had promised her that
we could go for a walk in the snow) so I accepted it. I looked at the
peaklist from SHELXD pretending not to know that there should be 12
sites. There was a bit of a gap in peakheight 0.53/0.42 between peaks 11
and 12 for 'possible' and 0.53/0.45 between peaks 10 and 11 for
'impossible', so for SHELXE I used -h11 and -h10 respectively. However I
also used the new -z option that refines the substructure before
starting on the phasing, and as it turns out that increased the number
of heavy atoms to 12 in both cases and as it happens all 12 were correct
in both cases. I started shelxe with:

shelxe possible possible_fa -s0.55 -a30 -h11 -z -q -e1

and similarly for 'impossible'. I was expecting problems so I did 30
cycles autotracing, normally 3 would be enough. I just guessed the
solvent content (-s0.55), maybe that could be fine-tuned. For SHELXE,
there is a remarkably consistent rule that if the CC for the trace
against the native data gets above 25%, the structure is solved. For
'possible' this happened after 25 tracing cycles, and the final 'free
lunch' map (-e1) was indeed convincing. However 'impossible' only
reached a CC of 17% and although the map did not look completely wrong,
I would not have been able to interpret it. So I changed one default
parameter (-m30), increasing the number of density modification cycles
to compensate for the poor starting phases, and ran the job again. CC
reached 25% after 16 cycles and produced an excellent map and trace.
Almost certainly, 'possible' would also benefit from the change, but it
was solved anyway. As Tom has already pointed out, sometimes a small
change can cause the tracing to take a different path and make the
difference between success and failure.

George



> Woops!  sorry folks.  I made a mistake with the I(+)/I(-) entry.  They
> had the wrong axis convention relative to 3dko and the F in the same
> file.  Sorry about that.
>
> The files on the website now should be right.
> http://bl831.als.lbl.gov/~jamesh/challenge/possible.mtz
> http://bl831.als.lbl.gov/~jamesh/challenge/impossible.mtz
>
> md5 sums:
> c4bdb32a08c884884229e8080228d166  impossible.mtz
> caf05437132841b595be1c0dc1151123  possible.mtz
>
> -James Holton
> MAD Scientist
>
> On 1/12/2013 8:25 AM, James Holton wrote:
>>
>> Fair enough!
>>
>> I have just now added DANO  and I(+)/I(-) to the files.  I'll be very
>> interested to see what you can come up with!  For the record, the
>> phases therein came from running mlphare with default parameters but
>> exactly the correct heavy-atom constellation (all the sulfur atoms in
>> 3dko), and then running dm with default parameters.
>>
>> Yes, there are other ways to run mlphare and dm that give better
>> phases, but I was only able to determine those parameters by
>> "cheating" (comparing the resulting map to the right answer), so I
>> don't think it is "fair" to use those maps.
>>
>> I have had a few questions about what is "cheating" and what is not
>> cheating.  I don't have a problem with the use of sequence
>> information because that actually is something that you realistically
>> would know about your protein when you sat down to collect data.  The
>> sequence of this molecule is that of 3dko:
>> http://bl831.als.lbl.gov/~jamesh/challenge/seq.pir
>>
>>   I also don't have a problem with anyone actually using an
>> automation program to _help_ them solve the "impossible" dataset as
>> long as they can explain what they did.  Simply putting the above
>> sequence into BALBES would, of course, be cheating!  I suppose one
>> could try eliminating 3dko and its "homologs" from the BALBES search,
>> but that, in and of itself, is perhaps relevant to the challenge:
>> "what is the most distance homolog that still allows you to solve the
>> structure?".  That, I think, is also a stringent test of
>> model-building skill.
>>
>>   I have already tried ARP/wARP, phenix.autobuild and
>> buccaneer/refmac.  With default parameters, all of these programs
>> fail on both the "possible" and "impossible" datasets.  It was only
>> with some substantial tweaking that I found a way to get
>> phenix.autobuild to crack the "possible" dataset (using 20 models in
>> parallel).  I have not yet found a way to get any automation program
>> to build its way out of the "impossible" dataset.   Personally, I
>> think that the breakthrough might be something like what Tom
>> Terwilliger mentioned.  If you build a good enough starting set of
>> atoms, then I think an automation program should be able to take you
>> the rest of the way.  If that is the case, then it means people like
>> Tom who develop such programs for us might be able to use that
>> insight to improve the software, and that is something that will
>> benefit all of us.
>>
>> Or, it is entirely possible that I'm just not running the current
>> software properly!  If so, I'd love it if someone who knows better
>> (such as their developers) could enlighten me.
>>
>> -James Holton
>> MAD Scientist
>>
>> On 1/12/2013 3:07 AM, Pavol Skubak wrote:
>>>
>>> Dear James,
>>>
>>> your challenge in its current form ignores an important source
>>> of information for model building that is available for your
>>> simulated data - namely, it does not allow to use anomalous
>>> phase information in the model building. In difficult cases on
>>> the edge of success such as this one, this typically makes
>>> the difference between building and not building.
>>>
>>> If you can make the F+/F- and Se substructure available, we
>>> can test whether this is the case indeed. However, while I
>>> expect this would push the challenge further significantly,
>>> most likely you would be able to decrease the Se incorporation
>>> of your simulated data further to such levels that the anomalous
>>> signal is again no longer sufficient to build the structure. And
>>> most likely, there would again exist an edge where a small
>>> decrease in the Se incorporation would lead from a model built
>>> to no model built.
>>>
>>> Best regards,
>>>
>>> --
>>> Pavol Skubak
>>> Biophysical Structural Chemistry
>>> Gorleaus Laboratories
>>> Einsteinweg 55
>>> Leiden University
>>> LEIDEN  2333CC
>>> the Netherlands
>>> tel: 0031715274414 <tel:0031715274414>
>>> web: http://bsc.lic.leidenuniv.nl/people/skubak-0
>>
>


--
Prof. George M. Sheldrick FRS
Dept. Structural Chemistry,
University of Goettingen,
Tammannstr. 4,
D37077 Goettingen, Germany
Tel. +49-551-39-3021 or -3068
Fax. +49-551-39-22582