Yep, I very strongly agree with you that that images need structured
data, but IMO that data probably needs to be embedded directly into the
graphics file itself, otherwise the title, subject, provenance, owner
and copyright details are lost as soon as someone exports or downloads
I thought that Microsoft actually had the beginnings of a good
multimedia structured metadata system with their Win3.1 RIFF file
format(s), but it was difficult to persuade people to use it, because MS
themselves didn't take the lead by including simple tools with the OS to
edit and read the tags. And then the EBU technical wizards went and
corrupted the thing (again, because there was a lack of reference tools
that they could use to realise that they were doing it wrong). The lack
of a proper industry focus on embedded media metadata has been a bit of
a bugbear of mine ever since.
I didn't notice metadata really impinging on the public consciousness
until people started building MP3 libraries, and suddenly you had
members of the general public demanding robust structured metadata
systems that allowed files to be moved between different computers and
operating systems while retaining all their attached information, and
insisting that any MP3 collection software ought to be able to read the
metadata of any file, transparently, without the user having to do
anything, and without any data being lost. The software had to work
intuitively, with decent graphical interfaces that didn't need an IT
qualification to operate, and users wanted it yesterday. People who
wrote MP3 organiser programs had to get their acts together because
users weren't locked in, alternatives were often free, and the software
products that didn't perform didn't survive. Because the information
professionals were less demanding and more forgiving than the domestic
users, the baseline reliability of some systems aimed at the
"professional" market remained lower than the ones aimed at "amateurs".
The "home" metadata users got more consistent functionality in their
products than the "pros" because they had a better idea of what they
wanted, and they squealed louder when they didn't get it.
I also very much agree with you about the desirability of open-source
solutions. I want freely-available data to to be transferrable and
readable, without restriction, on any system, under any software,
without licensing lock-ins and restrictions, without any loss of
information. I'm a big fan of structured data, when it's appropriate and
done properly, and I really like the Freebase initiative (apart from all
the icky jargon that comes with it).
However, as you say, coming up with appropriate data structures isn't
always easy. It's an undervalued skill. It needs someone who has an
understanding of data-structuring, but also someone who has an in-depth
knowledge of the source material, and in small, specialist museums there
might well be nobody that has both sets of skills. A museum's expert on
Seventeeth-Century doll fabrics isn't /necessarily/ going to have a lot
of database-building experience. They might do, but it's not guaranteed.
What you can do in those situations is have the people who understand
the exhibits write semi-structured entries that include all the key
information, and once you have your dataset accessible and easily
browsable, your IT person can read and familiarise themselves with it,
and do some further structuring. And maybe the two people can talk to
each other as the project progresses, and learn from each other.
However, if there's a rigidly-defined "standards-compliant" procedure
that says that they best way to save duplicated effort is for the data
to be fully structured from the start, then that can be a really bad
idea. If you put the task of structuring onto the non-IT-literate
"exhibit" expert, the result is probably going to be badly designed, and
if you decide that the "IT" expert is going to set up all the data
structures at the start of the cataloguing process then at that point,
the IT person isn't yet familiar with the particular needs of the
collection, and won't know what's important. Once the data's actually in
front of both of them them, its easier for them to explain to each other
their different points of view about what's on the screen, but until
there's that common reference, its tricky for each of them to try to
explain the subtleties of their subject to the other.
Worst-case, your "exhibit" specialist gets taught that the "proper" way
to catalogue is using just the IT expert's first attempt at a structure,
and you end up with a catalogue that isn't just badly structured and in
need of a total overhaul, but actually missing important data that would
have been present in a more unstructured entry, but got missed because
there wasn't a structure to hold it. The inputter leaves out the data
because they think it isn't important, or isn't wanted (because of the
lack of an input field), and the IT person doesn't see the problem,
because they never get to see the information that's /not/ being input.
I agree that there are going to be subjects where a high degree of
structuring is essential ... your "architecture" case is a great
example because of the number of recurring cryptic architectural terms
that can mean different things in different contexts. A surname could
mean a maker, or a manufacturing company, or a style, or a derivative
style, or a town, or a building name, or architect, or any number of
other things. Structuring that data might be bloody difficult, but at
least its a reasonably familiar problem, due to the amount of academic
work that's already been done on the subject.
But with more obscure subjects, you're not necessarily in a position to
devise a structure until after you've already collected most of the
data, and in some cases, the terminology is already so specific that its
not obvious that "heavy" structuring adds anything. In the museum that
I'm currently working at, if I want to search the database for "Dinky
Ford Cortina" or "Hornby Flying Scotsman", I'll be able to find the
entries regardless of whether the data is structured or not. Beyond a
certain point, additional structuring of this data doesn't obviously add
In /this/ collection, because of the fluidity of some of the
manufacturers' brandnames names and their historic subcontracting
arrangements, if I'm searching for an item nominally by a given
manufacturer, I need to be able to find close matches with that name in
any other field -- overly-structured searching is counterproductive. If
an item was assembled by company A in country B, with distinctive parts
from company C in country D, so that it was sold as an AB but a
collector recognises it as a CD(A), and it was sold dual-branded in the
destination market as "made by E for F", or perhaps with a further
reference to G that at the time referred to a product line but later
became a brand in its own right, by which time the final company in the
chain had been sold to someone else ... then any single company name
that an indexer puts into the manufacturer box is at best only going to
represent the name that was in largest print on the item's retail box in
a particular year. For some of these items, even the manufacturers and
distributors couldn't make up their minds about what they were making,
where it came from and what it should be called. This stuff gets messy.
For a lot of these items, categorisation needs to be "soft", and that
can be difficult to do with strict structuring.
What's more important to the person searching (in these cases) is that
the inputter has added everything that they might want to search by,
whether it's occurred to the database person to include a box for it or
not. It's more important to the end-user that the inputter has applied
common sense and specialist knowledge over what to input, than that
they've dutifully followed a strict formal procedure. Strict cataloguing
procedure doesn't always preserve delicate nuances, and sometimes has a
habit of casting shades of grey into stark black-and-white in an
inappropriate way. It can corrupt and damage data (unless you have
"catch-all" fields, which then tend to end up being used for almost
So ... XML is cool, and strict categorisation can be great when its
appropriate, but strict, formal, centrally-decided XML isn't the answer
to everything, and a fixation on XML-ling everything can lead to other
aspects being neglected. Webpage interfacing standards get neglected.
Search-engine integration gets forgotten about, unless its an XML
solution vendor's product. Organisations forget to metatag their images,
and check that their processing software doesn't strip tags. A sucky
help system, converted to XML, is likely to still be sucky. Good content
with bad formatting can always be improved, bad content with wonderful
formatting is a bit more difficult, because a casual viewer might have
no way of knowing that it's bad.
We can end up with a focus on imposing ever-stricter and more awkward
jargon, which operators have to be specially taught how to use, which
spawns training courses and certificates, and new training courses when
the system changes, and licensing restrictions to support all the
support infrastructure. It becomes more difficult for casual volunteers
to use the system, the results are less suitable for presenting to the
general public, and it costs museums more to train their staff. So we
make stuff more and more complicated and technical instead of developing
smart interpretational software that looks for patterns in the data, and
makes suggestions. Instead of training computers to respond more like
people, we train people to think more like computers. Instead of
developing more sophisticated interfaces, we keep them looking like old
programmer's software development systems from the 1990s, because it's
easier to make money selling service contracts for complex systems than
by making things work sufficiently well in the first place that people
don't need outside help.
So, yes, in general, I agree that structuring is often a good thing (and
sometimes essential), and centrally-decided standards are also often
useful ... for instance, it's best if an embedded copyright field has a
standard identifier that everyone can recognise and read, rather than
everyone coming up with their own unique methods of tagging copyright
data. But in other cases, it's best if the decisions about how to
structure data and how strongly to structure it are made locally, by the
people actually at the sharp end. Imposing strict nationally-decided
standards onto a museum in an attempt to guarantee the quality of their
cataloguing process isn't always helpful, and if the purpose of
standardisation is to help the small, specialist Museum swap data with
other similar museums, and the only similar museums are in other
countries where those national standards aren't going to be used, then
it might be quite difficult for the small Museum to work out exactly
what the point is of having those national standards, if they're just
creating additional national barriers between a Museum and the foreign
specialist datasources that they might want to access.
The example that always springs to mind for me as a textbook failure of
the "hard" cataloguing approach was the attempt to unify the UK and US
bibliographic cataloguing systems. For years, apparently, the two sides
were in a slightly Swiftian deadlock because they couldn't agree on the
correct spelling of the word "catalogue". Both sides agreed that there
/was/ a correct spelling, but that it was theirs. To me, that's the
result of trying to force the data to fit an artificially-imposed
"official" system, and the approach that would have been more sensitive
to the underlying data would have been to accept both spellings, and
maybe use whichever one locally that the local group preferred.
The contrasting success of the "soft" cataloguing approach was when the
International Committee for the Red Cross changed their official name to
ICRC. Their problem was that while the Red Cross name and logo
symbolised medical aid and help in Northern Europe and the US, in the
Middle East it was the emblem of invading Christian knights during the
Crusades. So the ICRC is known as the "Red Cross" over here, and over
there its known as the "Red Crescent", and the single official name is
just "ICRC", which forks into two local "known as" names and logos. The
last two letters of ICRC don't have a fixed meaning, because the ICRC
were smart enough to understand that they didn't need to have one. As
long as the organisation had a fixed agreed name (which in this case was
four letters) those letters didn't have to officially stand for
anything. It was radical, but there was no technical reason why they
couldn't do it.
The ICRC were bright enough to understand which aspects of naming
systems were required and which were merely historical convention,
whereas the people who catalogued stuff professionally were too hung up
on fixed single answers and taught standard spellings to be able to
accept a flexible approach.
If we develop cataloguing systems that are "soft", and automatically
deal with different terminology dialects (as well as US/UK spellings),
then we'll have a system that'll not only let us update "awkward" legacy
terminologies and migrate to more useful versions, cope with
international spellings, and make connections between databases that
have been built using different schemes, we'll also have the beginnings
of a system that might be eventually able to cope with comparing
databases built in different languages and maybe even different scripts.
Those things would take a lot of work, but coping intelligently with
dialects would be a first step.
On the other hand, if we adopt entrenched hard-coded national standards
for terminology, and our answer to the resulting incompatibilities is to
say that the different local terminologies should fight it out until
there's one national winner ... then we're just putting a wall around
the UK and pretending that the outside world doesn't exist, and deciding
that UK museums won't want to have anything to do with the Smithsonian,
and UK galleries don't need to have anything to do with the Louvre. It
means that we're not actually learning anything about how to connect
systems across "soft" interfaces, and that when the smarter systems
start turning up (probably with the help of EU development funding),
they probably won't be developed in the UK, and if there are any changes
that we'd like made to make those systems friendlier to UK
organisations, then tough, because we won't have had a hand in the
And our local software businesses will struggle to even be contractors,
because they won't understand how the things work.
Which is mildly depressing.
On 06/09/2011 09:10, J DAVIS wrote:
> Interesting ideas, Eric.
> What you don't see is what the search engines don't find - and unlike me, few people will wade through 50 or more pages to find the result they want.
> I search across collections a lot - not just museums' collections but cultural (and sometimes natural) heritage collections looked after in archives (including historic environment records), libraries, historic houses, galleries, heritage sites etc.
> After a research project I worked on over a decade ago that looked at searching museum image databsaes, I became firmly convinced that as more collections information (and images) went online, the more difficult it would become to search for specific things unless descriptions were structured and used controlled vocabulary. I am seeing that in today's Web environment. I think that we also should be using open source and sustainable technology as much as possible.
> I also know from experience that it is time-consuming and requires some knowledge and experience to describe things in a structured way. When I've been involved in developing or proposing redevelopment of a new system for a collection, I try to make it easier for people to describe things in a way that puts them in the right semantic context, even if they don't know the exact word for the object/type of site etc.
> Of course, there are also masses of records that are described perfectly well by experts in their own narrow context but make no sense when released into the wild, unedited (I usually quote listed buildings descriptions - I studied architectural history at degree level and worked for English Heritage for 11 years, and still find unexpurgated listed buildings descriptions semantically indigestible).
> Best wishes,
> Janet E Davis