The only way to really do "creative" stats on the PDB is to just
download the whole thing. It is a sobering thought to realize
just how tiny it is! Less than 17 GB. Once you've got it all on
your hard disk you can start writing little programs to look for
different things. I have posted mine for doing the "method count"
here:
http://bl831.als.lbl.gov/~jamesh/pickup/pdb_method_count_notes.txt
For those of you who don't speak awk, basically what I do is first
look at the "method used to determine the structure" entry, but
when that is NULL or otherwise uninformative you can do a number
of things. If the entry lists another PDB entry in "methods", it
is probably a molecular replacement solution. Also, if the
"software" used to solve it was PHASER or MOLREP, or ARP/wARP, or
COMO etc., then I'm willing to bet it is an MR solution too. On
the other hand, if the "software" was SOLVE or AUTOSHARP, then I
imagine they probably were doing MAD/SAD. Even if they didn't
know it.
Currently, the script produces this list:
1 COCRYSTAL
1 FIBER-DIFFRACTION
1 MOLPREP
1 N?
1 P
1 SEE CITATION
1 UNCONVENTIANAL
1 UNNECESSARY
2 UNCONVENTIONAL
7 RIP
54 N/A
92 SIR
260 AI
355 MIRAS
434 SIRAS
689 OTHER
1058 MIR
5012 MAD
5335 SAD
8021 NULL
58293 MR
Clearly, there's a few new ones I need to "clean up", but some of
them are funny. What is the method of "COCRYSTAL" anyway?
Co-crystallized with a ligand? Or a heavy atom? There are plenty
of entries that I can't figure out automatically (NULL + OTHER),
but I imagine most of them are MR.
But yes, Tim and Thierry are right, there is definitely confusion
and disagreement about what exactly constitutes "molecular
replacement". I have been somewhat draconian here and lumped "MR"
and "direct refinement" together because frankly it is just too
hard to tell them apart from the PDB alone. In fact, there are
260 PDB entries that claim they were solved by "AB INITIO"
methods, but did they really use direct methods with no prior
phase information at all? Or did they do MAD/SAD and thought that
meant "ab initio"? Probably examples of all. The resolution of
the structure might be helpful in this case, but sometimes even
reading the paper doesn't help.
I do agree though that the use of "molecular replacement" should
be true to the term as Rossmann coined it, where the 6D search of
a model against the new dataset was actually required to "solve"
the structure. Lots of people run molrep (or molprep!?) for each
and every dataset, even when they are doing ligand soaks. Never
really have understood why.
However, I'm sure the day is not far off when phenix.refine or the
like will check if the starting R factor is too high and just
"automatically" invoke a run of MR to see if something clicks.
Maybe even trying alternative space groups, just in case you
screwed that up too. There may also soon be "automaitc" runs of
BALBES based on the sequence information of the input file.
Eventually, the difference between different "methods" gets
muddied. Then your "average" depositor will be even more unsure
about what to put into REMARK 200, and people like me will be
endlessly asking for command-line options that turn these
"features" off.
-James Holton
MAD Scientist
On 4/15/2013 6:48 AM, Raji Edayathumangalam wrote: