Print

Print



The only way to really do "creative" stats on the PDB is to just download the whole thing.  It is a sobering thought to realize just how tiny it is!  Less than 17 GB.  Once you've got it all on your hard disk you can start writing little programs to look for different things.  I have posted mine for doing the "method count" here:
http://bl831.als.lbl.gov/~jamesh/pickup/pdb_method_count_notes.txt

For those of you who don't speak awk, basically what I do is first look at the "method used to determine the structure" entry, but when that is NULL or otherwise uninformative you can do a number of things.  If the entry lists another PDB entry in "methods", it is probably a molecular replacement solution.  Also, if the "software" used to solve it was PHASER or MOLREP, or ARP/wARP, or COMO etc., then I'm willing to bet it is an MR solution too.  On the other hand, if the "software" was SOLVE or AUTOSHARP, then I imagine they probably were doing MAD/SAD.  Even if they didn't know it.

Currently, the script produces this list:
1   COCRYSTAL
1   FIBER-DIFFRACTION
1   MOLPREP
1   N?
1   P
1   SEE CITATION
1   UNCONVENTIANAL
1   UNNECESSARY
2   UNCONVENTIONAL
7 RIP
54   N/A
92 SIR
260 AI
355 MIRAS
434 SIRAS
689   OTHER
1058 MIR
5012 MAD
5335 SAD
8021   NULL
58293 MR

Clearly, there's a few new ones I need to "clean up", but some of them are funny.  What is the method of "COCRYSTAL" anyway?  Co-crystallized with a ligand?  Or a heavy atom?  There are plenty of entries that I can't figure out automatically (NULL + OTHER), but I imagine most of them are MR.

But yes, Tim and Thierry are right, there is definitely confusion and disagreement about what exactly constitutes "molecular replacement".  I have been somewhat draconian here and lumped "MR" and "direct refinement" together because frankly it is just too hard to tell them apart from the PDB alone.  In fact, there are 260 PDB entries that claim they were solved by "AB INITIO" methods, but did they really use direct methods with no prior phase information at all?  Or did they do MAD/SAD and thought that meant "ab initio"?  Probably examples of all.  The resolution of the structure might be helpful in this case, but sometimes even reading the paper doesn't help.

I do agree though that the use of "molecular replacement" should be true to the term as Rossmann coined it, where the 6D search of a model against the new dataset was actually required to "solve" the structure.  Lots of people run molrep (or molprep!?) for each and every dataset, even when they are doing ligand soaks.  Never really have understood why. 

However, I'm sure the day is not far off when phenix.refine or the like will check if the starting R factor is too high and just "automatically" invoke a run of MR to see if something clicks.  Maybe even trying alternative space groups, just in case you screwed that up too.  There may also soon be "automaitc" runs of BALBES based on the sequence information of the input file.  Eventually, the difference between different "methods" gets muddied.  Then your "average" depositor will be even more unsure about what to put into REMARK 200, and people like me will be endlessly asking for command-line options that turn these "features" off.

-James Holton
MAD Scientist

On 4/15/2013 6:48 AM, Raji Edayathumangalam wrote:
[log in to unmask]" type="cite">Hi Folks,

Does anyone know of an accurate way to mine the PDB for what percent of total X-ray structures deposited as on date were done using molecular replacement? I got hold of a pie chart for the same from my Google search for 2006 but I'd like to get hold of the most current statistics, if possible. The PDB has all kinds of statistics but not one with numbers or precent of X-ray structures deposited sorted by various phasing types or X-ray structure determination methods.

For example, an "Advanced Search" on the PDB site pulls up the following:

Total current structures by X-ray: 78960
48666 by MR

5139 by MAD

5672 by SAD

1172 by MIR

94 by MIR (when the word is completely spelled out)

75 by SIR
5 by SIR (when the word is completely spelled out)

That leaves about 19,000 X-ray structures either solved by other phasing methods (seems unlikely) or somehow unaccounted for in the way I am searching. Maybe the way I am doing the searches is no good. Does someone have a better way to do this?

Thanks much.
Raji

--
Raji Edayathumangalam
Instructor in Neurology, Harvard Medical School
Research Associate, Brigham and Women's Hospital
Visiting Research Scholar, Brandeis University