Hi Marcin,
On Thu, 2013-09-19 at 14:51 +0100, Marcin Wojdyr wrote:
> Hi Peter,
>
> On Thu, Sep 19, 2013 at 10:28:22AM +0100, Peter Keller wrote:
>
> > > http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#bnf
> ...
> >
> > This grammar seems to be based on the 1994 J. Chem. Inf. Comp. Sci one,
> > which has some serious errors. I would strongly discourage anyone from
> > trying to translate it into input for any kind of parser generator. I
> > suggest that you use International Tables vol. G instead (chapter 2.1 or
> > section 2.2.7). It is unfortunate that the later, correct, grammar is
>
> I don't have these tables,
Are you sure? I would be surprised if you didn't have them available
through your library, either as hard copies, or through an on-line
subscription at the DOI links I gave in my article. International Tables
are pretty fundamental to CCP4's domain of MX, as well as several
others, after all. Perhaps you could have a word with the library staff?
> but could you be more specific what's incorrect
> in the version from the IUCR website?
This is ancient (mid-to-late 1990's) history for me: I would need to
track down some old e-mail correspondence and hunt through it, and I
don't have the time at the moment. I do remember a problem with the way
that quoted strings were defined, but that (and other errors) that I
spotted then may have been fixed. However, giving it a quick look, I can
see for example the following problem:
<LoopBody> : <Value> { <WhiteSpace> <Value> }*
For this to work, the '*' must be a "greedy" quantifier, i.e. match
every { <WhiteSpace> <Value> } until it hits something that is not
{ <WhiteSpace> <Value> }. In this production though:
<SingleQuotedString> <WhiteSpace>: <single_quote> {<AnyPrintChar>}* <single_quote> <WhiteSpace>
the '*' has to be a "lazy" quantifier, i.e. match <AnyPrintChar> only as
far as the next <single_quote> <WhiteSpace> . Bear in mind that
<AnyPrintChar> includes both <single_quote> and two of the characters
that are also included in <WhiteSpace>.
Differences like this can be expressed in human-written code as long as
the coder is aware of them. A grammar that is intended to be used to
match data or generate a parser requires a more rigorous definition. Any
parser/lexer that uses a greedy quantifier for * would match a line of
data like this:
val1 'val "2"' "val '3'" 'val '4'' val5
as just three tokens:
val1
'val "2"' "val '3'" 'val '4''
val4
rather than as five tokens. OTOH, using a lazy quantifier for * would
only match the first data value in a loop, and then throw a syntax error
for every loop body (except the trivial case which genuinely has only
one data name in the header and data value in the body).
>
> I just googled cif lexers and the two ones I looked into also refer
> to the same URL that I used:
> cctbx: http://cci.lbl.gov/cctbx_sources/ucif/cif.g
> JMol: http://caagt.ugent.be/CaGe/jmol/org/jmol/adapter/smarter/CifReader.RidiculousFileFormatTokenizer.html
>
> If there are discrepancies between IUCR website and IT vol.G and it would
> be worth to list them.
It is not a matter of discrepancies: they are rather different, and if
you are active in this area, you really need to see the IT ones as well.
Regards,
Peter.
--
Peter Keller Tel.: +44 (0)1223 353033
Global Phasing Ltd., Fax.: +44 (0)1223 366889
Sheraton House,
Castle Park,
Cambridge CB3 0AX
United Kingdom
|