Actually, they are not quite the same.

Members of an ensemble spread out over a supercell are equivalent to a conventional multi-conformer model, but only when it comes to the density derived from the coordinate atoms alone.  They are not the same when it comes to bulk solvent and also not the same when it comes to clashes at crystal packing interfaces.

It is actually a tricky question: if you have a 2-conformer salt bridge between symmetry mates, which one should be "A" or "B"? In phenix, "A" will clash with "A" in the symmetry mate. Refmac, as I understand it, checks if the occupancies of any two atoms of any conformer letter sum to >= 1.0. If so, they can clash.  Helen tells me Vagabond's treatment of such contacts is under development. A consensus has not been reached.

However, in a supercell refinement you don't have to ask this question. Every atom has an occupancy of 1.00 and there are no conformer letters (other than the default " "). Molecules then interact with their neighbors by the same rules as anywhere else in space.  And yes, each one would get its own, personal, solvent envelope.  Neat, huh?

If you want to try supercell refinement, do this:
1) expand your mtz data to P1.  Use the CCP4 "cad" program with the keyword "outlim space 1". Do not change the space group.
2) in a second run of "cad", change the space group to P1 and multiply all your Fs by the number of full unit cells you want. Lets say that is 8.  I.E.:
 "scale file 1 8.0 0"
3) re-index this P1 mtz file with the "reindex" program, using keyword "reindex 2h,2k,2l". You will get a new mtz that has 2x the cell length along each edge, but is only 12.5% complete. Don't worry, that's going to be OK.
4) Now go to the *.pdb file: edit the "CRYST1" line to have the same cell as the new mtz, and change the space group to P1.

5) make copies of your coordinates and shift them to fill this new P1 supercell. Note that this is not only populating all symmetry mates in one unit cell (total of 4 in P212121), but also making translated copies in the seven other unit cells. This would be 4*8 or 32 copies in all. How you name them depends on your personal preferences, but a simple approach is to give each symmetry mate its own chain ID (upper and lower case are allowed).  The "symgen P212121" keyword is convenient, but does not make any effort to pack things into a box. You need to do that yourself by applying various "shift 1 0 -1" like operations in pdbset.  This might be more intuitive to do in coot.  Also, while you are doing this you need to decide which part of your multi-unit-cell space is going to correspond to your "conformer A", and which other part is going to be "B", and then "C", etc.  You won't be doing occupancy refinement here. Rather, by adjusting how many "A"s and "B"s you use to make up the ensemble you effectively now have 32 levels of occupancy (instead of the usual 100).

Important point: If you find this whole process daunting and you keep running out of letters in the alphabet, now you understand why the authors of refinement programs have not made this automatic. There are a VERY large number of ways to pack these slightly-different copies of your ASU into a supercell, and only one is "best". Yes, as Vaheh said, all the combinations will be equivalent when it comes to the density derived from coordinates, but the solvent density will be allowed to conform to each local environment, and your clash scores will vary.  Which combination is "correct"?  I think that would be the one with the least number of clashes.  It is like a big combination lock.  But, for a first try, you might want to just use the "symgen" and "shift 1 0 0" commands, re-label all the chain IDs, then put the same confomer everywhere and let non-bond repulsions do their job in the subsequent refinement.

6) give your new re-indexed mtz file and expanded-to-P1-supercell pdb file to your favorite refinement program.

Now for the amazing part: this actually works.  At least refmac5, phenix.refine and shelxl (in my hands) have absolutely no problem refining this highly "redundant" model against data that are only 12.5% complete.  They don't complain at all!  Your R factors and geometry scores will be stable, but hopefully at least a little better than if you used a conventional, single-ASU ensemble.  You might think it would be a good idea to fill all the "missing" HKLs with zeroes.  Do not do this.  Phenix.refine hates it and refmac5 only barely tolerates it.  You might also think that with so many "free parameters" that your Rwork would drop to zero right away (leaving Rfree far behind) but in practice that does not happen.

Very important: DO NOT LOOK AT THE MAPS that come out of this kind of refinement. Not directly. They will look all weird and distorted. You need to average over all the ASUs to recover interpretable 2Fo-Fc and Fo-Fc maps. The strictly correct way to do this is to cut out each ASU, re-orient and align these maps, and then average them all together. Fortunately, there is another way to do this that is much easier and faster:

7) Re-index the refinement output mtz file with "reindex h/2,k/2,l/2", and in the same run of "reindex" change your space group back to what it was in your normal refinement.  Do NOT run this mtz file through cad.  Cad will throw out all but one ASU. In fact, don't open this re-indexed mtz file with any program other than coot or "fft". The reason is this mtz file still has P1 data. It is "overcomplete". Just like when cad performed the "outlim space 1" there is more than one ASU in the mtz.  Perhaps it is a quirk in the fft algorithm, but this all-ASU averaging happens automatically if the input reflection data is "overcomplete".

8) finally, you might want to write a script for reversing the procedure for populating your supercell so that you can align all the members into a single ASU and look at them.


But wait!? Isn't this "over-fitting"?  No, it is not.  Over-fitting in general is when you drive your residual (aka Rwork) essentially to zero.  That doesn't happen here because the geometry term holds you back.  You might think that with enough copies in the ensemble it should be possible to fit any density with geometrically reasonable molecules, but that is not what happens in practice.  This is actually quite remarkable!  So many "free parameters" and yet the tug-o-war between R factors and geometry remains.  Now, of course, the real molecules in the real crystal are simultaneously obeying the laws of chemistry and generating the diffraction patterns that we see, but I have never found a way to make a macromolecular model that does both.  Neither has anybody else, of course, but wouldn't it be cool if someone figured out how?

-James Holton
MAD Scientist

On 3/18/2023 6:35 PM, Oganesyan, Vaheh wrote:
[log in to unmask]">

Hi Ben,

 

All copies created by multiplying cell dimensions will act exactly same as the original one, mathematically exactly. Nick’s approach is better. Something similar to what Nick said was published around 2002-2003. I was reviewing it. I did not understand then what the author was trying to achieve and kept thinking about it for few months. The author split model into 20 each with 5% occupancy. After refinement he got an ensemble that looked like NMR structures. I’m not sure, however, that adding that uncertainty will help answering any question.

 

Vaheh

 

From: CCP4 bulletin board <[log in to unmask]> On Behalf Of benjamin bax
Sent: Saturday, March 18, 2023 5:07 PM
To: [log in to unmask]
Subject: Re: [ccp4bb] To Trim or Not to To Trim

Hi,
Probably a stupid question.
Could you multiply a, b and c cell dimensions by 2 or 3 (to give 8 or 27 structures) and restrain well defined parts of structure to be ‘identical’ ? To give you a more NMR like chemically sensible ensemble of structures?
Ben


> On 18 Mar 2023, at 12:04, Helen Ginn <[log in to unmask]> wrote:
>
> Models for crystallography have two purposes: refinement and interpretation. Here these two purposes are in conflict. Neither case is handled well by either trim or not trim scenario, but trimming results in a deficit for refinement and not-trimming results in a deficit for interpretation.
>
> Our computational tools are not “fixed” in the same way that the standard amino acids are “fixed” or your government’s bureaucracy pathways are “fixed”. They are open for debate and for adjustments. This is a fine example where it may be more productive to discuss the options for making changes to the model itself or its representation, to better account for awkward situations such as these. Otherwise we are left figuring out the best imperfect way to use an imperfect tool (as all tools are, to varying degrees!), which isn’t satisfying for enough people, enough of the time.
>
> I now appreciate the hypocrisy in the argument “do not trim, but also don’t model disordered regions”, even though I’d be keen to avoid trimming. This discussion has therefore softened my own viewpoint.
>
> My refinement models (as implemented in Vagabond) do away with the concept of B factors precisely for the anguish it causes here, and refines a distribution of protein conformations which is sampled to generate an ensemble. By describing the conformations through the torsion angles that comprise the protein, modelling flexibility of a disordered lysine is comparatively trivial, and indeed modelling all possible conformations of a disordered loop becomes feasible. Lysines end up looking like a frayed end of a rope. Each conformation can produce its own solvent mask, which can be summed together to produce a blurring of density that matches what you would expect to see in the crystal.
>
> In my experience this doesn’t drop the R factors as much as you’d assume, because blurred out protein density does look very much like solvent, but it vastly improves the interpretability of the model. This also better models the boundary between the atoms you would trim and those you’d leave untrimmed, by avoiding such a binary distinction. No fear of trimming and pushing those errors unseen into the rest of the structure. No fear of leaving atoms in with an inadequate B factor model that cannot capture the nature of the disorder.
>
> Vagabond is undergoing a heavy rewrite though, and is not yet ready for human consumption. Its first iteration worked on single-dataset-single-model refinement, which handled disordered side chains well enough, with no need to decide to exclude atoms. The heart of the issue lies in main chain flexibility, and this must be handled correctly, for reasons of interpretability and elucidating the biological impact. This model isn’t perfect either, and necessitates its own compromises - but will provide another tool in the structural biology arsenal.
>
> —-
>
> Dr Helen Ginn
> Group leader, DESY
> Hamburg Advanced Research Centre for Bioorganic Chemistry (HARBOR)
> Luruper Chaussee 149
> 22607 Hamburg
>
> ########################################################################
>
> To unsubscribe from the CCP4BB list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
>
> This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1




To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1