On 12/Apr/2013, at 11:26 , Neel Smith <[log in to unmask]> wrote:
> CTS, the retrieval protocol, should be distinguished from CTS URNs, the citation format.
I agree absolutely. I am only talking about the first artefact.
> On Apr 11, 2013, at 6:42 PM, Scot Mcphee <[log in to unmask]> wrote:
>
>>
>> But why does the catalogue not have the actual CTS URN as a field within it's data structure?
>
> Probably because if you don't like the structure of the CTS GetCapabilities reply, it provides all the information you need to engineer around it, and validates against a schema, so it shouldn't require much effort to parse
>
> <textgroup n="ns:groupid">
> <work n="ns:workid">
> <edition n="ns:editionid">
> ....
> </edition>
> </work>
> </textgroup>
>
> and generate urn:cts:ns:groupid.workid.editionid
Yes, I already worked that out (its the subject of my blog post). The algorithm is not explained in any of the CTS documentation that I saw. It may be a secondary result of the description of the format. But that is no substitute for a straightforward explanation as to how to use the data.
The additional issue is the definition of "much effort". I agree, presented in a small fragment, the algorithm is simple, and probably easily expressed in xPath. The problem is when its combined with a 2MB XML file.
xPath necessarily uses DOM. XML DOM requires fully parsed document tree. This occupies memory, it's well known a memory hog, and takes CPU to generate. SAX is the alternative but its stream based processing make the sort of processing needed more difficult to achieve with out a document tree fully laid out in memory. A 2MB XML tree is not a trivial document tree. A device, like a smartphone, may likely choke on such a document tree. In fact my browser, running on a quad-core macbook pro with 8GB of memory, frequently chokes on that file, even when I load it off my disk (so it's not the network cause the issue, it's the size of the data set).
>> My argument is that the "textgroup", "work", "edition" and "translation" elements should all have a child "urn" element, not that there should be a search feature.*
>>
>> At that point it becomes possible for a third party to decently implement a search feature or any other document interaction protocol. And without a search feature, no-one will be using the URNs to identify anything because they won't know about them.
>>
>> *By "search" here I don't mean "full text search" I just mean, author/title lookup to URN discovery.
>
> Sure --so one approach would be to batch process a GetCapabilities reply to implement a search app, by mapping group names, work titles, etc, to appropriate level urns.
>
> Alternatively, a third party could also independently create a simple mapping of searchable labels/names to CTS URNs and layer that on top of a CTS. Think of something like the skos RDF vocabuary and implement something like
I think I bring a different sort of perspective to the table here. Here, I think is the difference; as a well-trained and practiced systems analyst I never think just about data formats: I consider *working systems*. To me, RDF, or CTS, or what-have-you is just a way to access the data from a working system of some form, and to put into another working system of another form. In fact, if I can, unless its already picked for me, I *never* pick a data format until I have decided the use-cases: what the end-user, or consuming systems needs to do with it drives the data representation. I would just as well use JSON, or CSV, an object database, a relational database, XHTML, a flat text file, or RDF - or even a mixture of all or some of those technologies (say, a JSON representation coming off os REST-like URL scheme backed by a relational database).
I appreciate, this is not the perspective of Perseus, or the CTS project, that you're much closer to the perspective of the data and the format of its representation.
I guess I'm just trying to impress upon you the fact that the choices which are made over data formats have real consequences to the way they can be used. You may think they are trivial, however, what I am saying is that currently they are not. I don't of course expect anyone to fall over themselves to correct themselves just for my convenience, but what I am reporting, is what I flatly regard as a *bug* in the data specification: no URN field.
Anyway, I appreciate your time. When I have determined what the best way to proceed is, I will let the list know with a post describing my solution, or announcing my available resource.
regards
scot.
|