> Gerard
> I said in INCREASING order of influence/power i.e. you are in first place.
Ooohhhh! *Now* it makes sense! :-)
--Gerard
> The joke comes from
> " I used to think if there was reincarnation, I wanted to come back as the
> President or the Pope or a .400 baseball hitter. But now I want to come back
> as the bond market. You can intimidate everyone.
> --James Carville, Clinton campaign strategist"
>
> Thanks for the comprehensive reply
> Regards
> Colin
>
> -----Original Message-----
> From: CCP4 bulletin board [mailto:[log in to unmask]] On Behalf Of Gerard DVD Kleywegt
> Sent: 28 October 2011 22:03
> To: ccp4bb
> Subject: [ccp4bb] To archive or not to archive, that's the question!
>
> Hi all,
>
> It appears that during my time here at Cold Spring Harbor, I have missed a
> small debate on CCP4BB (in which my name has been used in vain to boot).
>
> I have not yet had time to read all the contributions, but would like to make
> a few points that hopefully contribute to the discussion and keep it with two
> feet on Earth (as opposed to La La Land where the people live who think that
> image archiving can be done on a shoestring budget... more about this in a
> bit).
>
> Note: all of this is on personal title, i.e. not official wwPDB gospel. Oh,
> and sorry for the new subject line, but this way I can track the replies more
> easily.
>
> It seems to me that there are a number of issues that need to be separated:
>
> (1) the case for/against storing raw data
> (2) implementation and resources
> (3) funding
> (4) location
>
> I will say a few things about each of these issues in turn:
>
> -----------
>
> (1) Arguments in favour and against the concept of storing raw image data, as
> well as possible alternative solutions that could address some of the issues
> at lower cost or complexity.
>
> I realise that my views carry a weight=1.0 just like everybody else's, and
> many of the arguments and counter-arguments have already been made, so I will
> not add to these at this stage.
>
> -----------
>
> (2) Implementation details and required resources.
>
> If the community should decide that archiving raw data would be scientifically
> useful, then it has to decide how best to do it. This will determine the level
> of resources required to do it. Questions include:
>
> - what should be archived? (See Jim H's list from (a) to (z) or so.) An
> initial plan would perhaps aim for the images associated with the data used in
> the final refinement of deposited structures.
>
> - how much data are we talking about per dataset/structure/year?
>
> - should it be stored close to the source (i.e., responsibility and costs for
> depositors or synchrotrons) or centrally (i.e., costs for some central
> resource)? If it is going to be stored centrally, the cost will be
> substantial. For example, at the EBI -the European Bioinformatics Institute-
> we have 15 PB of storage. We pay about 1500 GBP (~2300 USD) per TB of storage
> (not the kind you buy at Dixons or Radio Shack, obviously). For stored data,
> we have a data-duplication factor of ~8, i.e. every file is stored 8 times (at
> three data centres, plus back-ups, plus a data-duplication centre, plus
> unreleased versus public versions of the archive). (Note - this is only for
> the EBI/PDBe! RCSB and PDBj will have to acquire storage as well.) Moreover,
> disks have to be housed in a building (not free!), with cooling, security
> measures, security staff, maintenance staff, electricity (substantial cost!),
> rental of a 1-10 Gb/s connection, etc. All hardware has a life-cycle of three
> years (barring failures) and then needs to be replaced (at lower cost, but
> still not free).
>
> - if the data is going to be stored centrally, how will it get there? Using
> ftp will probably not be feasible.
>
> - if it is not stored centrally, how will long-term data availability be
> enforced? (Otherwise I could have my data on a public server until my paper
> comes out in print, and then remove it.)
>
> - what level of annotation will be required? There is no point in having
> zillions of files lying around if you don't know which
> structure/crystal/sample they belong to, at what wavelength they were
> recorded, if they were used in refinement or not, etc.
>
> - an issue that has not been raised yet, I think: who is going to validate
> that the images actually correspond to the structure factor amplitudes or
> intensities that were used in the refinement? This means that the data will
> have to be indexed, integrated, scaled, merged, etc. and finally compared to
> the deposited Fobs or Iobs. This will have to be done for *10,000 data sets a
> year*... And I can already imagine the arguments that will follow between
> depositors and "re-processors" about what software to use, what resolution
> cut-off, what outlier-rejection criteria, etc. How will conflicts and
> discrepancies be resolved? This could well end up taking a day of working time
> per data set, i.e. with 200 working days per year, one would need 50 *new*
> staff for this task alone. For comparison: worldwide, there is currently a
> *total* of ~25 annotators working for the wwPDB partners...
>
> Not many of you know that (about 10 years ago) I spent probably an entire year
> of my life sorting out the mess that was the PDB structure factor files
> pre-EDS... We were apparently the first people to ever look at the tens of
> thousands of structure factor files and try to use all of them to calculate
> maps for the EDS server. (If there were others who attempted this before us,
> they had probably run away screaming.) This went well for many files, but
> there were many, many files that had problems. There were dozens of different
> kinds of issues: non-CIF files, CIF files with wrong headers, Is instead of
> Fs, Fcalc instead of Fobs, all "h" equal to 0, non-space-separated columns,
> etc. For a list, see: http://eds.bmc.uu.se/eds/eds_help.html#PROBLEMS
>
> Anyway, my point is that simply having images without annotation and without
> reprocessing is like having a crystallographic kitchen sink (or bit bucket)
> which will turn out to be 50% useless when the day comes that somebody wants
> to do archive-wide analysis/reprocessing/rerefinement etc. And if the point is
> to "catch cheaters" (which in my opinion is one of the weakest, least-fundable
> arguments for storage), then the whole operation is in fact pointless without
> reprocessing by a "third party" at deposition time.
>
> -----------
>
> (3) Funding.
>
> This is one issue we can't really debate - ultimately, it is the funding
> agencies who have to be convinced that the cost/benefit ratio is low enough.
> The community will somehow have to come up with a stable, long-term funding
> model. The outcome of (2) should enable one to estimate the initial investment
> cost plus the variable cost per year. Funding could be done in different ways:
>
> - centrally - e.g., a big application for funding from NIH or EU
>
> - by charging depositors (just like they are charged Open Access charges,
> which can often be reclaimed from the funding agencies) - would you be willing
> to pay, say, 5000 USD per dataset to secure "perpetual" storage?
>
> - by charging users (i.e., Gerard Bricogne :-) - just kidding!
>
> Of course, if the consensus is to go for decentralised storage and a DOI-like
> identifier system, there will be no need for a central archive, and the
> identifiers could be captured upon deposition in the PDB. (We could also check
> once a week if the files still exist where they are supposed to be.)
>
> -----------
>
> (4) Location.
>
> If the consensus is to have decentralised storage, the solution is quite
> simple and very cheap in terms of "centralised" cost - wwPDB can capture
> DOI-like identifiers upon deposition and make them searchable.
>
> If central storage is needed, then there has to be an institution willing and
> able to take on this task. The current wwPDB partners are looking at future
> funding that is at best flat, with increasing numbers of depositions that also
> get bigger and more complex. There is *no way on earth* that wwPDB can accept
> raw data (be it X-ray, NMR or EM! this is not an exclusive X-ray issue)
> without *at least* double the current level of funding (and not just in the US
> for RCSB, but also in Japan for PDBj and in Europe for PDBe)! I am pretty
> confident that this is simply *not* going to happen.
>
> [Besides, in my own humble opinion, in order to remain relevant (and
> fundable!) in the biomedical world, the PDB will have to restyle itself as a
> biomedical resource instead of a crystallographic archive. We must take the
> structures to the biologists, and we must expand in breadth of coverage to
> include emerging hybrid methods that are relevant for structural cell (as
> opposed to molecular) biology. This mission will be much easier to fund on
> three continents than archiving TBs of raw data that have little or no
> tangible (i.e., fundable) impact on our quest to find a cure for various kinds
> of cancer (or hairloss) or to feed a growing population.]
>
> However, there may be a more realistic solution. The role model could be NMR,
> which has its own global resource for data storage in the BMRB. BMRB is a
> wwPDB partner - if you deposit an NMR model with us, we take your ensemble
> coordinates, metadata, restraints and chemical shifts - any other NMR data
> (including spectra and FIDs) can subsequently be deposited with BMRB. These
> data will get their own BMRB ID which can be linked to the PDB ID.
>
> A model like this has advantages - it could be housed in a single place, run
> by X-ray experts (just as BMRB is co-located with NMRFAM, the national NMR
> facility at Madison), and there would be only one place that would need to
> secure the funding (which would be substantially larger than the estimate of
> $1000 per year suggested by a previous poster from La La Land). This could for
> instance be a synchrotron (linked to INSTRUCT?), or perhaps one of the
> emerging nations could be enticed to take on this challenging task. I would
> expect that such a centre would be closely affiliated with the wwPDB
> organisation, or become a member just like BMRB. A similar model could also be
> employed for archiving raw EM image data.
>
> -----------
>
> I've said enough for today. It's almost time for the booze-up that kicks off
> the PDB40 symposium here at CSHL! Heck, some of you who read this might be
> here as well!
>
> Btw - Colin Nave wrote:
>
> "(in increasing order of influence/power do we have the Pope, US president,
> the Bond Market and finally Gerard K?)"
>
> I'm a tad disappointed to be only in fourth place, Colin! What has the Pope
> ever done for crystallography?
>
> --Gerard
>
> ******************************************************************
> Gerard J. Kleywegt
>
> http://xray.bmc.uu.se/gerard mailto:[log in to unmask]
> ******************************************************************
> The opinions in this message are fictional. Any similarity
> to actual opinions, living or dead, is purely coincidental.
> ******************************************************************
> Little known gastromathematical curiosity: let "z" be the
> radius and "a" the thickness of a pizza. Then the volume
> of that pizza is equal to pi*z*z*a !
> ******************************************************************
>
Best wishes,
--Gerard
******************************************************************
Gerard J. Kleywegt
http://xray.bmc.uu.se/gerard mailto:[log in to unmask]
******************************************************************
The opinions in this message are fictional. Any similarity
to actual opinions, living or dead, is purely coincidental.
******************************************************************
Little known gastromathematical curiosity: let "z" be the
radius and "a" the thickness of a pizza. Then the volume
of that pizza is equal to pi*z*z*a !
******************************************************************
|