You know Kevin,
April fools notwithstanding, you idea actually makes good sense in a
tiny-url sort of way. There would of course be collisions, and thus,
need for a global disambiguation registry, but society could do a whole
lot worse than something like:
http://prot.seq.db/3fc28e91d74b39ec/a6
(translated: protein sequence hash #afc28e91274739ec, registry index
#a6)
as a way of unambiguously storing, referring to, and retrieving known
sequences.
The URL, when requested, would of course simply return the registered
sequence. Keeping the scope extremely narrow like that would be the key
to the registry's success: just "natural 20" sequences with no
annotations.
Optimal details might differ of course (CRC64 is suboptimal for ASCII
sequences), but as a general concept, I do think you're on to something
powerful here...
Cheers,
Warren
> -----Original Message-----
> From: CCP4 bulletin board [mailto:[log in to unmask]] On Behalf Of
> Kevin Cowtan
> Sent: Wednesday, April 01, 2009 5:02 AM
> To: [log in to unmask]
> Subject: Re: [ccp4bb] New human genome policy - please read.
>
> Why molecular weight? That's just arbitrary.
>
> There is a simple way of referring to proteins which avoids any
> ambiguity - by it's sequence. When referring to a protein, we should
use
> its sequence as an identifier. No ambiguity.
>
> Now, I understand that some smart people in America are now solving
> proteins of more than a dozen aa in length. For these, quoting the
whole
> sequence could be a bit long. Fortunately this is a solved problem:
all
> we need to do is quote a CRC64 hash of the ascii representation of the
> protein sequence. This gives a name space big enough that we can name
> about 4 billion proteins before the probability of a name clash
becomes
> significant.
>
>
> James Stroud wrote:
> > I think actually *naming* the proteins would be too extreme. Even
the
> > current alpha-numeric system is overwrought. I liked it better when
we
> > just called proteins "p75" or "p105". For instance, how many
proteins in
> > the human genome are 75 kD, anyway? My guess is not enough to make
the
> > situation ambiguous in any catastrophic way.
>
>
>
|