JISCMail - MCG Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
MCG Archives

MCG@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		MCG Home
		MCG September 2011
Options

Subscribe or Unsubscribe
Get Password
Subject:
Over-structured data (was: Re: How about a Museums-only search engine?)
From:
Eric Baird <[log in to unmask]>
Reply-To:
Museums Computer Group <[log in to unmask]>
Date:
Thu, 8 Sep 2011 03:00:44 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (276 lines)
Hi Janet!

Yep, I very strongly agree with you that that images need structured 
data, but IMO that data probably needs to be embedded directly into the 
graphics file itself, otherwise the title, subject, provenance, owner 
and copyright details are lost as soon as someone exports or downloads 
the image.

I thought that Microsoft actually had the beginnings of a good 
multimedia structured metadata system with their Win3.1 RIFF file 
format(s), but it was difficult to persuade people to use it, because MS 
themselves didn't take the lead by including simple tools with the OS to 
edit and read the tags. And then the EBU technical wizards went and 
corrupted the thing (again, because there was a lack of reference tools 
that they could use to realise that they were doing it wrong). The lack 
of a proper industry focus on embedded media metadata has been a bit of 
a bugbear of mine ever since.
I didn't notice metadata really impinging on the public consciousness 
until people started building MP3 libraries, and suddenly you had 
members of the general public demanding robust structured metadata 
systems that allowed files to be moved between different computers and 
operating systems while retaining all their attached information, and 
insisting that any MP3 collection software ought to be able to read the 
metadata of any file, transparently, without the user having to do 
anything, and without any data being lost. The software had to work 
intuitively, with decent graphical interfaces that didn't need an IT 
qualification to operate, and users wanted it yesterday. People who 
wrote MP3 organiser programs had to get their acts together because 
users weren't locked in, alternatives were often free, and the software 
products that didn't perform didn't survive. Because the information 
professionals were less demanding and more forgiving than the domestic 
users, the baseline reliability of some systems aimed at the 
"professional" market remained lower than the ones aimed at "amateurs". 
The "home" metadata users got more consistent functionality in their 
products than the "pros" because they had a better idea of what they 
wanted, and they squealed louder when they didn't get it.


I also very much agree with you about the desirability of open-source 
solutions. I want freely-available data to to be transferrable and 
readable, without restriction, on any system, under any software, 
without licensing lock-ins and restrictions, without any loss of 
information. I'm a big fan of structured data, when it's appropriate and 
done properly, and I really like the Freebase initiative (apart from all 
the icky jargon that comes with it).

However, as you say, coming up with appropriate data structures isn't 
always easy. It's an undervalued skill. It needs someone who has an 
understanding of data-structuring, but also someone who has an in-depth 
knowledge of the source material, and in small, specialist museums there 
might well be nobody that has both sets of skills. A museum's expert on 
Seventeeth-Century doll fabrics isn't /necessarily/ going to have a lot 
of database-building experience. They might do, but it's not guaranteed.

What you can do in those situations is have the people who understand 
the exhibits write semi-structured entries that include all the key 
information, and once you have your dataset accessible and easily 
browsable, your IT person can read and familiarise themselves with it, 
and do some further structuring. And maybe the two people can talk to 
each other as the project progresses, and learn from each other.

However, if there's a rigidly-defined "standards-compliant" procedure 
that says that they best way to save duplicated effort is for the data 
to be fully structured from the start, then that can be a really bad 
idea. If you put the task of structuring onto the non-IT-literate 
"exhibit" expert, the result is probably going to be badly designed, and 
if you decide that the "IT" expert is going to set up all the data 
structures at the start of the cataloguing process then at that point, 
the IT person isn't yet familiar with the particular needs of the 
collection, and won't know what's important. Once the data's actually in 
front of both of them them, its easier for them to explain to each other 
their different points of view about what's on the screen, but until 
there's that common reference, its tricky for each of them to try to 
explain the subtleties of their subject to the other.
Worst-case, your "exhibit" specialist gets taught that the "proper" way 
to catalogue is using just the IT expert's first attempt at a structure, 
and you end up with a catalogue that isn't just badly structured and in 
need of a total overhaul, but actually missing important data that would 
have been present in a more unstructured entry, but got missed because 
there wasn't a structure to hold it. The inputter leaves out the data 
because they think it isn't important, or isn't wanted (because of the 
lack of an input field), and the IT person doesn't see the problem, 
because they never get to see the information that's /not/ being input.


I agree that there are going to be subjects where a high degree of 
structuring is essential ... your "architecture" case is a great 
example  because of the number of recurring cryptic architectural terms 
that can mean different things in different contexts. A surname could 
mean a maker, or a manufacturing company, or a style, or a derivative 
style, or a town, or a building name, or architect, or any number of 
other things. Structuring that data might be bloody difficult, but at 
least its a reasonably familiar problem, due to the amount of academic 
work that's already been done on the subject.

But with more obscure subjects, you're not necessarily in a position to 
devise a structure until after you've already collected most of the 
data, and in some cases, the terminology is already so specific that its 
not obvious that "heavy" structuring adds anything. In the museum that 
I'm currently working at, if I want to search the database for "Dinky 
Ford Cortina" or "Hornby Flying Scotsman", I'll be able to find the 
entries regardless of whether the data is structured or not. Beyond a 
certain point, additional structuring of this data doesn't obviously add 
anything useful.
In /this/ collection, because of the fluidity of some of the 
manufacturers' brandnames names and their historic subcontracting 
arrangements, if I'm searching for an item nominally by a given 
manufacturer, I need to be able to find close matches with that name in 
any other field -- overly-structured searching is counterproductive. If 
an item was assembled by company A in country B, with distinctive parts 
from company C in country D, so that it was sold as an AB but a 
collector recognises it as a CD(A), and it was sold dual-branded in the 
destination market as "made by E for F", or perhaps with a further 
reference to G that at the time referred to a product line but later 
became a brand in its own right, by which time the final company in the 
chain had been sold to someone else ... then any single company name 
that an indexer puts into the manufacturer box is at best only going to 
represent the name that was in largest print on the item's retail box in 
a particular year. For some of these items, even the manufacturers and 
distributors couldn't make up their minds about what they were making, 
where it came from and what it should be called. This stuff gets messy.

For a lot of these items, categorisation needs to be "soft", and that 
can be difficult to do with strict structuring.

What's more important to the person searching (in these cases) is that 
the inputter has added everything that they might want to search by, 
whether it's occurred to the database person to include a box for it or 
not. It's more important to the end-user that the inputter has applied 
common sense and specialist knowledge over what to input, than that 
they've dutifully followed a strict formal procedure. Strict cataloguing 
procedure doesn't always preserve delicate nuances, and sometimes has a 
habit of casting shades of grey into stark black-and-white in an 
inappropriate way. It can corrupt and damage data (unless you have 
"catch-all" fields, which then tend to end up being used for almost 
everything).

-----

So ... XML is cool, and strict categorisation can be great when its 
appropriate, but strict, formal, centrally-decided XML isn't the answer 
to everything, and a fixation on XML-ling everything can lead to other 
aspects being neglected. Webpage interfacing standards get neglected. 
Search-engine integration gets forgotten about, unless its an XML 
solution vendor's product. Organisations forget to metatag their images, 
and check that their processing software doesn't strip tags. A sucky 
help system, converted to XML, is likely to still be sucky. Good content 
with bad formatting can always be improved, bad content with wonderful 
formatting is a bit more difficult, because a casual viewer might have 
no way of knowing that it's bad.
We can end up with a focus on imposing ever-stricter and more awkward 
jargon, which operators have to be specially taught how to use, which 
spawns training courses and certificates, and new training courses when 
the system changes, and licensing restrictions to support all the 
support infrastructure. It becomes more difficult for casual volunteers 
to use the system, the results are less suitable for presenting to the 
general public, and it costs museums more to train their staff. So we 
make stuff more and more complicated and technical instead of developing 
smart interpretational software that looks for patterns in the data, and 
makes suggestions. Instead of training computers to respond more like 
people, we train people to think more like computers. Instead of 
developing more sophisticated interfaces, we keep them looking like old 
programmer's software development systems from the 1990s, because it's 
easier to make money selling service contracts for complex systems than 
by making things work sufficiently well in the first place that people 
don't need outside help.

So, yes, in general, I agree that structuring is often a good thing (and 
sometimes essential), and centrally-decided standards are also often 
useful ... for instance, it's best if an embedded copyright field has a 
standard identifier that everyone can recognise and read, rather than 
everyone coming up with their own unique methods of tagging copyright 
data. But in other cases, it's best if the decisions about how to 
structure data and how strongly to structure it are made locally, by the 
people actually at the sharp end. Imposing strict nationally-decided 
standards onto a museum in an attempt to guarantee the quality of their 
cataloguing process isn't always helpful, and if the purpose of 
standardisation is to help the small, specialist Museum swap data with 
other similar museums, and the only similar museums are in other 
countries where those national standards aren't going to be used, then 
it might be quite difficult for the small Museum to work out exactly 
what the point is of having those national standards, if they're just 
creating additional national barriers between a Museum and the foreign 
specialist datasources that they might want to access.

----

The example that always springs to mind for me as a textbook failure of 
the "hard" cataloguing approach was the attempt to unify the UK and US 
bibliographic cataloguing systems. For years, apparently, the two sides 
were in a slightly Swiftian deadlock because they couldn't agree on the 
correct spelling of the word "catalogue". Both sides agreed that there 
/was/ a correct spelling, but that it was theirs. To me, that's the 
result of trying to force the data to fit an artificially-imposed 
"official" system, and the approach that would have been more sensitive 
to the underlying data would have been to accept both spellings, and 
maybe use whichever one locally that the local group preferred.

The contrasting success of the "soft" cataloguing approach was when the 
International Committee for the Red Cross changed their official name to 
ICRC. Their problem was that while the Red Cross name and logo 
symbolised medical aid and help in Northern Europe and the US, in the 
Middle East it was the emblem of invading Christian knights during the 
Crusades. So the ICRC is known as the "Red Cross" over here, and over 
there its known as the "Red Crescent", and the single official name is 
just "ICRC", which forks into two local "known as" names and logos. The 
last two letters of ICRC don't have a fixed meaning, because the ICRC 
were smart enough to understand that they didn't need to have one. As 
long as the organisation had a fixed agreed name (which in this case was 
four letters) those letters didn't have to officially stand for 
anything. It was radical, but there was no technical reason why they 
couldn't do it.

The ICRC were bright enough to understand which aspects of naming 
systems were required and which were merely historical convention, 
whereas the people who catalogued stuff professionally were too hung up 
on fixed single answers and taught standard spellings to be able to 
accept a flexible approach.
If we develop cataloguing  systems that are "soft", and automatically 
deal with different terminology dialects (as well as US/UK spellings), 
then we'll have a system that'll not only let us update "awkward" legacy 
terminologies and migrate to more useful versions, cope with 
international spellings, and make connections between databases that 
have been built using different schemes, we'll also have the beginnings 
of a system that might be eventually able to cope with comparing 
databases built in different languages and maybe even different scripts. 
Those things would take a lot of work, but coping intelligently with 
dialects would be a first step.

On the other hand, if we adopt entrenched hard-coded national standards 
for terminology, and our answer to the resulting incompatibilities is to 
say that the different local terminologies should fight it out until 
there's one national winner ...  then we're just putting a wall around 
the UK and pretending that the outside world doesn't exist, and deciding 
that UK museums won't want to have anything to do with the Smithsonian, 
and UK galleries don't need to have anything to do with the Louvre. It 
means that we're not actually learning anything about how to connect 
systems across "soft" interfaces, and that when the smarter systems 
start turning up (probably with the help of EU development funding), 
they probably won't be developed in the UK, and if there are any changes 
that we'd like made to make those systems friendlier to UK 
organisations, then tough, because we won't have had a hand in the 
development.
And our local software businesses will struggle to even be contractors, 
because they won't understand how the things work.


Which is mildly depressing.

Eric Baird

On 06/09/2011 09:10, J DAVIS wrote:
> Interesting ideas, Eric.
>
> What you don't see is what the search engines don't find - and unlike me, few people will wade through 50 or more pages to find the result they want.
> I search across collections a lot - not just museums' collections but cultural (and sometimes natural) heritage collections looked after in archives (including historic environment records), libraries, historic houses, galleries, heritage sites etc.
>
> After a research project I worked on over a decade ago that looked at searching museum image databsaes, I became firmly convinced that as more collections information (and images) went online, the more difficult it would become to search for specific things unless descriptions were structured and used controlled vocabulary. I am seeing that in today's Web environment. I think that we also should be using open source and sustainable technology as much as possible.
>
> I also know from experience that it is time-consuming and requires some knowledge and experience to describe things in a structured way. When I've been involved in developing or proposing redevelopment of a new system for a collection, I try to make it easier for people to describe things in a way that puts them in the right semantic context, even if they don't know the exact word for the object/type of site etc.
>
> Of course, there are also masses of records that are described perfectly well by experts in their own narrow context but make no sense when released into the wild, unedited (I usually quote listed buildings descriptions - I studied architectural history at degree level and worked for English Heritage for 11 years, and still find unexpurgated listed buildings descriptions semantically indigestible).
>
> Best wishes,
> Janet
>
> Janet E Davis
>

****************************************************************
       website:  http://museumscomputergroup.org.uk/
       Twitter:  http://www.twitter.com/ukmcg
      Facebook:  http://www.facebook.com/museumscomputergroup
 [un]subscribe:  http://museumscomputergroup.org.uk/email-list/
****************************************************************
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options