JISCMail - DC-RDA Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
DC-RDA Archives

DC-RDA@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		DC-RDA Home
		DC-RDA February 2009
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: datasets for testing rda at scale
From:
Alistair Miles <[log in to unmask]>
Reply-To:
List for discussion on Resource Description and Access (RDA)
Date:
Tue, 17 Feb 2009 12:25:40 +0000
Content-Type:
text/plain
Parts/Attachments:
text/plain (197 lines)
Hi Corey,

Good to hear from you. Yes, I checked out the SIMILE work, although I
haven't studied it in detail. If you scroll down the page at:

http://dublincore.org/dcmirdataskgroup/DataConversion

you'll see a sample record in MARC XML, MODS XML, and SIMILE MODS RDF
format for comparison.

Cheers,

Alistair

On Mon, Feb 16, 2009 at 10:26:09AM -0500, Corey A Harper wrote:
> Hi Alistair,
>
> I think I may have mentioned this to you before, but if not, have you  
> seen the early MIT / SIMILE work on MODS->RDF? [1]  While I think  
> there's a few inaccuracies therein, and it certainly doesn't help at all  
> with the RDA/FRBR bits of your analysis, it might still be worth looking  
> at, even if only to inform or augment the work you've got going.
>
> I'm really excited to see some of this in action as you continue to make  
> progress.
>
> Thanks,
> -Corey
>
> [1] http://simile.mit.edu/wiki/MARC/MODS_RDFizer
>
> Alistair Miles wrote:
>> Hi Karen,
>>
>> On Fri, Feb 13, 2009 at 06:46:37AM -0800, Karen Coyle wrote:
>>> Alistair,
>>>
>>> I did start an analysis of RDA and MARC, but didn't get very far. 
>>> I'll  take that up again. What I was mainly finding is that there are 
>>> a lot of  RDA elements that are listed for more than one MARC 
>>> element, e.g.
>>>
>>> $a Personal name* = 9.2.2 Preferred Name for the Person*
>>> $b Numeration = *9.2.2 Preferred Name for the Person
>>
>> Yes, I expect there will be lots of issues like this, in both
>> directions. Please do continue your analysis, this type if insight is
>> very useful.
>>
>> I should say that I don't hope to create either a complete or perfect
>> mapping from mods to RDF/RDA/FRBR. Rather I hope to map just enough to
>> capture a significant amount of useful information, to demonstrate the
>> potential for further work in this direction.
>>
>> Cheers,
>>
>> Alistair
>>
>>> There are ones that go the other way, as well, where RDA is more   
>>> specific than MARC. It made me wonder how it is that we use the 
>>> specific  MARC elements: are they needed for display? do they help 
>>> input? are they  arbitrary?
>>>
>>> I haven't looked at MODS, however, and there isn't a mapping provided 
>>>  between MODS and RDA. I'll think about that, however.
>>>
>>> kc
>>>
>>> *Alistair Miles wrote:
>>>> Hi all,
>>>>
>>>> This is just an update to say that I've converted the LOC/scriblio
>>>> data to marc xml and from there to mods xml. My next step is to do
>>>> some analysis of the loc data in mods xml to get an overview of the
>>>> elements used, then to try to design at least a partial mapping from
>>>> mods xml to RDF using the RDA and FRBR schemas.
>>>>
>>>> FYI the marc xml and mods xml versions of the LOC/scriblio data can be
>>>> downloaded from the links below...
>>>>
>>>> http://dcmi-rda.s3.amazonaws.com/locdata/part01-marcxml.tar.gz
>>>> http://dcmi-rda.s3.amazonaws.com/locdata/part01-modsxml.tar.gz
>>>> http://dcmi-rda.s3.amazonaws.com/locdata/part02-marcxml.tar.gz
>>>> http://dcmi-rda.s3.amazonaws.com/locdata/part02-modsxml.tar.gz
>>>> [...]
>>>> http://dcmi-rda.s3.amazonaws.com/locdata/part29-marcxml.tar.gz
>>>> http://dcmi-rda.s3.amazonaws.com/locdata/part29-modsxml.tar.gz
>>>>
>>>> Each download is a gzipped tar containing a *set* of up to 25 xml
>>>> files. Each of these files is a 10,000 record split of the data in the
>>>> corresponding part. I broke each part into 10,000 record splits so I
>>>> could process the transformations more easily.
>>>>
>>>> N.B. there is a bug in part 13 split 25, for some reason the marc xml
>>>> output was incomplete so up to 10,000 records could be missing.
>>>>
>>>> FWIW I initially tried the conversions without splitting each
>>>> part. I.e. I converted each original marc file into a single marc xml
>>>> file, then tried to transform that to a mods xml file via
>>>> xsltproc. However I found you need more than 7GB ram to do the marcxml
>>>> to modsxml transform on a whole part (I tried it on a large ec2
>>>> instance), so that's when I decided to split each part into smaller
>>>> chunks, which I figured would be faster to process and more amenable
>>>> to parallel processing (transforming all the splits from marcxml to
>>>> modsxml took a couple of hours on a c1.xlarge ec2 instance, running up
>>>> to 10 transformations in parallel; it can also be done on a laptop,
>>>> but takes ~10 times longer).
>>>>
>>>> Btw if anyone else has experience of the marcxml->modsxml transform on
>>>> a file of similar size do let me know, I don't do a lot of xslt-ing so
>>>> may be missing some tricks for making it work on smaller computers.
>>>>
>>>> Cheers,
>>>>
>>>> Alistair
>>>>
>>>>
>>>> On Mon, Dec 22, 2008 at 03:31:50PM -0500, Ed Summers wrote:
>>>>   
>>>>> Hey Alistair:
>>>>>
>>>>> On Mon, Dec 22, 2008 at 1:16 PM, Alistair Miles
>>>>> <[log in to unmask]> wrote:
>>>>>     
>>>>>> Any tips for how I could turn these data into RDF?
>>>>>>       
>>>>> If you want to work specifically with that dataset you could download
>>>>> the different parts Karen pointed you to, and convert to MARCXML using
>>>>> an efficient tool like yaz-marcdump [2]. yaz-marcdump is nice it will
>>>>> convert from MARC-8 to UTF-8.
>>>>>
>>>>> Once you've got it in MARCXML you could then use a stylesheet like
>>>>> LC's [2] to convert to DublinCore flavored RDF. This might be kinda
>>>>> lossy for your RDA work though, so you might want MARCXML->MODS [3],
>>>>> and then use the MODS->RDF conversion that the Simile folks created
>>>>> (which Karen also pointed you to) [4].
>>>>>
>>>>> In fact Simile used that stylesheet on their own MIT Library Catalog
>>>>> MARC data (Barton) and still seem to have the result online [5]. So
>>>>> perhaps just using the Barton data is the quickest way to begin
>>>>> playing with what once was MARC data as RDF? To my knowledge Stefano
>>>>> Mazzocchi simply created an RDF vocabulary that mirrors the  MODS XML
>>>>> Schema, but I haven't looked at it in a while.
>>>>>
>>>>> Another thing worth checking out might be Rob Styles work [6] with
>>>>> other people at Talis at converting MARC with full fidelity to RDF.
>>>>> Perhaps he has some tools (or data) at his disposal? Rob you are on
>>>>> here right?
>>>>>
>>>>> I'd be willing to lend a hand with some of this if necessary, so just
>>>>> let me know if you think I can help.
>>>>>
>>>>> //Ed
>>>>>
>>>>> [1] http://www.indexdata.com/yaz/doc/yaz-marcdump.tkl
>>>>> [2] http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl
>>>>> [3] http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl
>>>>> [4] http://simile.mit.edu/wiki/MARC/MODS_RDFizer
>>>>> [5] http://simile.mit.edu/wiki/Dataset:_Barton
>>>>> [6] http://events.linkeddata.org/ldow2008/papers/02-styles-ayers-semantic-marc.pdf
>>>>>     
>>>>   
>>> -- 
>>> -----------------------------------
>>> Karen Coyle / Digital Library Consultant
>>> [log in to unmask] http://www.kcoyle.net
>>> ph.: 510-540-7596   skype: kcoylenet
>>> fx.: 510-848-3913
>>> mo.: 510-435-8234
>>> ------------------------------------
>>
>
> -- 
> Corey A Harper
> Metadata Services Librarian
> Bobst Library, B42-LL1
> New York University
> 70 Washington Square South
> New York, NY  10012
> 212.998.2479
> [log in to unmask]

-- 
Alistair Miles
Senior Computing Officer
Image Bioinformatics Research Group
Department of Zoology
The Tinbergen Building
University of Oxford
South Parks Road
Oxford
OX1 3PS
United Kingdom
Web: http://purl.org/net/aliman
Email: [log in to unmask]
Tel: +44 (0)1865 281993
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options