JISCMail - CCP4BB Archives

The only way to really do "creative" stats on the PDB is to just download the whole thing. It is a sobering thought to realize just how tiny it is! Less than 17 GB. Once you've got it all on your hard disk you can start writing little programs to look for different things. I have posted mine for doing the "method count" here:
http://bl831.als.lbl.gov/~jamesh/pickup/pdb_method_count_notes.txt

For those of you who don't speak awk, basically what I do is first look at the "method used to determine the structure" entry, but when that is NULL or otherwise uninformative you can do a number of things. If the entry lists another PDB entry in "methods", it is probably a molecular replacement solution. Also, if the "software" used to solve it was PHASER or MOLREP, or ARP/wARP, or COMO etc., then I'm willing to bet it is an MR solution too. On the other hand, if the "software" was SOLVE or AUTOSHARP, then I imagine they probably were doing MAD/SAD. Even if they didn't know it.

Currently, the script produces this list:
1   COCRYSTAL
1   FIBER-DIFFRACTION
1   MOLPREP
1   N?
1   P
1   SEE CITATION
1   UNCONVENTIANAL
1   UNNECESSARY
2   UNCONVENTIONAL
7 RIP
54   N/A
92 SIR
260 AI
355 MIRAS
434 SIRAS
689   OTHER
1058 MIR
5012 MAD
5335 SAD
8021   NULL
58293 MR

Clearly, there's a few new ones I need to "clean up", but some of them are funny. What is the method of "COCRYSTAL" anyway? Co-crystallized with a ligand? Or a heavy atom? There are plenty of entries that I can't figure out automatically (NULL + OTHER), but I imagine most of them are MR.

But yes, Tim and Thierry are right, there is definitely confusion and disagreement about what exactly constitutes "molecular replacement". I have been somewhat draconian here and lumped "MR" and "direct refinement" together because frankly it is just too hard to tell them apart from the PDB alone. In fact, there are 260 PDB entries that claim they were solved by "AB INITIO" methods, but did they really use direct methods with no prior phase information at all? Or did they do MAD/SAD and thought that meant "ab initio"? Probably examples of all. The resolution of the structure might be helpful in this case, but sometimes even reading the paper doesn't help.

I do agree though that the use of "molecular replacement" should be true to the term as Rossmann coined it, where the 6D search of a model against the new dataset was actually required to "solve" the structure. Lots of people run molrep (or molprep!?) for each and every dataset, even when they are doing ligand soaks. Never really have understood why.

However, I'm sure the day is not far off when phenix.refine or the like will check if the starting R factor is too high and just "automatically" invoke a run of MR to see if something clicks. Maybe even trying alternative space groups, just in case you screwed that up too. There may also soon be "automaitc" runs of BALBES based on the sequence information of the input file. Eventually, the difference between different "methods" gets muddied. Then your "average" depositor will be even more unsure about what to put into REMARK 200, and people like me will be endlessly asking for command-line options that turn these "features" off.

-James Holton
MAD Scientist

On 4/15/2013 6:48 AM, Raji Edayathumangalam wrote: