Hi Marcin, On Wed, 2013-09-18 at 18:33 +0100, Marcin Wojdyr wrote: > On Wed, Sep 18, 2013 at 03:41:34PM +0100, Peter Keller wrote: > > > I hope that this isn't quite what you meant.... There are already > > mutually-incompatible CIF dialects out there that have been created > > by developers coding to their own understanding and interpretations > > of the CIF/STAR format. I am sure that you would not want to be the > > creator of yet another one :-) Correct tokenising is a necessary > > (but not sufficient) condition for preventing the problem getting > > worse. > > This reminded me that I was looking into CIF grammar several years > ago. I took "Appendix A: A formal grammar for CIF": > http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#bnf > and I used it (after necessary syntax modification) in Boost.Spirit, > which is one of many parser generators. > > Then I noted two things that may be errors in the specification: > > - no whitespace between LoopHeader and LoopBody > > see <DataItems>: <LoopHeader> ends with <Tag>, <LoopBody> starts with > <Value>, but there is no <WhiteSpace> between. > > - extra "|" in <TokenizedComments> (...<eol> |}...) > > Am I right? This grammar seems to be based on the 1994 J. Chem. Inf. Comp. Sci one, which has some serious errors. I would strongly discourage anyone from trying to translate it into input for any kind of parser generator. I suggest that you use International Tables vol. G instead (chapter 2.1 or section 2.2.7). It is unfortunate that the later, correct, grammar is not available for free: I wonder if the IUCr and Springer might be persuaded to allow open access to some sections of this volume, or to allow redistribution through some other channel. As Herb pointed out, the full grammar is context-dependent. I haven't done any "real" mmCIF development for some time, but both my past experience and more recent discussions with developers who have to handle the format lead me to the opinion that it is a mistake to try to implement the full formal specification in a single parser. It is better to use a two-layer approach, i.e. to tokenise first (and all the context-dependence of the grammar is then kept in that layer), and implement a higher-level parser on top of that, which handles a sequence of tokens in which the token type and value have already been worked out. There is nothing original in this approach (it is the one outlined in chapters 3 and 4 of the classic Red and Purple Dragon Books: see <https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools>) and I have found that it works well for the STAR format. Regards, Peter. -- Peter Keller Tel.: +44 (0)1223 353033 Global Phasing Ltd., Fax.: +44 (0)1223 366889 Sheraton House, Castle Park, Cambridge CB3 0AX United Kingdom