JISCMail - CCP4BB Archives

Hi Marcin,

On Wed, 2013-09-18 at 18:33 +0100, Marcin Wojdyr wrote:
> On Wed, Sep 18, 2013 at 03:41:34PM +0100, Peter Keller wrote:
> 
> > I hope that this isn't quite what you meant....  There are already
> > mutually-incompatible CIF dialects out there that have been created
> > by developers coding to their own understanding and interpretations
> > of the CIF/STAR format. I am sure that you would not want to be the
> > creator of yet another one :-) Correct tokenising is a necessary
> > (but not sufficient) condition for preventing the problem getting
> > worse.
> 
> This reminded me that I was looking into CIF grammar several years
> ago. I took "Appendix A: A formal grammar for CIF":
> http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#bnf
> and I used it (after necessary syntax modification) in Boost.Spirit,
> which is one of many parser generators.
> 
> Then I noted two things that may be errors in the specification:
> 
> - no whitespace between LoopHeader and LoopBody
> 
>  see <DataItems>: <LoopHeader> ends with <Tag>, <LoopBody> starts with
>  <Value>, but there is no <WhiteSpace> between.
> 
> - extra "|" in <TokenizedComments> (...<eol> |}...)
> 
> Am I right?

This grammar seems to be based on the 1994 J. Chem. Inf. Comp. Sci one,
which has some serious errors. I would strongly discourage anyone from
trying to translate it into input for any kind of parser generator. I
suggest that you use International Tables vol. G instead (chapter 2.1 or
section 2.2.7). It is unfortunate that the later, correct, grammar is
not available for free: I wonder if the IUCr and Springer might be
persuaded to allow open access to some sections of this volume, or to
allow redistribution through some other channel.

As Herb pointed out, the full grammar is context-dependent. I haven't
done any "real" mmCIF development for some time, but both my past
experience and more recent discussions with developers who have to
handle the format lead me to the opinion that it is a mistake to try to
implement the full formal specification in a single parser. It is better
to use a two-layer approach, i.e. to tokenise first (and all the
context-dependence of the grammar is then kept in that layer), and
implement a higher-level parser on top of that, which handles a sequence
of tokens in which the token type and value have already been worked
out. There is nothing original in this approach (it is the one outlined
in chapters 3 and 4 of the classic Red and Purple Dragon Books: see
<https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools>) and I have found that it works well for the STAR format.

Regards,
Peter.

-- 
Peter Keller                                     Tel.: +44 (0)1223 353033
Global Phasing Ltd.,                             Fax.: +44 (0)1223 366889
Sheraton House,
Castle Park,
Cambridge CB3 0AX
United Kingdom