Print

Print


Hi Ethan,

> Is that "A0000" a hexidecimal number, or is it a decimal number
> that just happens to have an "A" in front of it?
> [A-Z][0-9999] gives a larger range of values than 5 bytes of hexadecimal,
> so I'm guessing it's the former.  But the example is not clear.
>
> (yes I could download and inspect the source, but I'm lazy tonight)

If you simply click on the

  Overview and Python reference implementation: (as text)

link you'll get the full story immediately in your web browser.
For your convenience, it is also attached below.

Cheers,
        Ralf


               Prototype/reference implementation for
                      encoding and decoding
                     atom serial numbers and
                     residue sequence numbers
                          in PDB files.

PDB ATOM and HETATM records reserve columns 7-11 for the atom serial
number. This 5-column number is used as a reference in the CONECT
records, which also reserve exactly five columns for each serial
number.

With the decimal counting system only up to 99999 atoms can be stored
and uniquely referenced in a PDB file. A simple extension to enable
processing of more atoms is to adopt a counting system with more than
ten digits. To maximize backward compatibility, the counting system is
only applied for numbers greater than 99999. The "hybrid-36" counting
system implemented in this file is:

  ATOM      1
  ...
  ATOM  99999
  ATOM  A0000
  ATOM  A0001
  ...
  ATOM  A0009
  ATOM  A000A
  ...
  ATOM  A000Z
  ATOM  ZZZZZ
  ATOM  a0000
  ...
  ATOM  zzzzz

I.e. the first 99999 serial numbers are represented as usual. The
following atoms use a base-36 system (10 digits + 26 letters) with
upper-case letters. 43670016 (26*36**4) additional atoms can be
numbered this way. If there are more than 43770015 (99999+43670016)
atoms, a base-36 system with lower-case letters is used, allowing for
43670016 additional atoms. I.e. in total 87440031 (99999+2*43670016)
atoms can be stored and uniquely referenced via CONECT records.

The counting system is designed to avoid lower-case letters until the
range of numbers addressable by upper-case letters is exhausted.
Importantly, with this counting system the distinction between
"traditional" and "extended" PDB files becomes evident only if there
are more than 99999 atoms to be stored. Programs that are
updated to support the hybrid-36 counting system will continue to
interoperate with programs that do not as long as there are less than
100000 atoms.

PDB ATOM and HETATM records also reserve columns 23-26 for the residue
sequence number. This 4-column number is used as a reference in other
record types (SSBOND, LINK, HYDBND, SLTBRG, CISPEP), which also reserve
exactly four columns for each sequence number.

With the decimal counting system only up to 9999 residues per chain can
be stored and uniquely referenced in a PDB file. If the hybrid-36
system is adopted, 1213056 (26*36**3) additional residues can be
numbered using upper-case letters, and the same number again using
lower-case letters. I.e. in total each chain may contain up to 2436111
(9999+2*1213056) residues that can be uniquely referenced from the
other record types given above.

The implementation in this file should run with Python 2.2 or higher.
There are no other requirements. Run this script without arguments to
obtain usage examples.

Note that there are only about 60 lines of "real" code. The rest is
documentation and unit tests.

To update an existing program to support the hybrid-36 counting system,
simply replace the existing read/write source code for integer values
with equivalents of the hy36decode() and hy36encode() functions below.

This file is unrestricted Open Source (cctbx.sf.net).
Please send corrections and enhancements to [log in to unmask] .

See also:
  http://cci.lbl.gov/hybrid_36/
  http://www.pdb.org/ "Dictionary & File Formats"

Ralf W. Grosse-Kunstleve, Feb 2007.



Park yourself in front of a world of choices in alternative vehicles.
Visit the Yahoo! Auto Green Center.