Thanks Ben (and others who have replied)
We're just working on this and trying a few things. I'll aim to report back on our 'solution'.
Cheers, James
-----Original Message-----
From: Museums Computer Group [mailto:[log in to unmask]] On Behalf Of Ben Rubinstein
Sent: 02 August 2016 16:22
To: [log in to unmask]
Subject: Re: Accented and special characters in collections search
Just one more update on this - a colleague has pointed out that Solr includes various filters (as part of the standard distribution, but not enabled by
default) which might also help. So for example the Soundex filter would probably match Törnquist, Toernquist, Turnquist and for that matter I should think Turnkwist!
Ben
On 26/07/2016 14:01, Ben Rubinstein wrote:
> On 26/07/2016 13:09, Howe, Michael P.A. wrote:
>> However, technically Törnquist is transliterated to Toernquist - and this
> wouldn't be picked up.
>
> This is where the synonyms file comes in (at least using Solr as James is, but
> there are similar systems elsewhere). So for example in the site we built for
> the Yiddish Book Centre we have a synonyms file with about 1600 lines such as
>
> gurshtayn,gurshtein,gurshteyn,gurstein,gurštejn,גורשטיינ
> schwartz,shvarts,shvartz,šwartz,šwarz,шварц,שווארצ,שוורצ,שורצ
>
> - of course it's a lot of work to build this up, but you can do the most
> common, obvious or important ones, and then do regular reviews from the logs
> of searches (and especially failed searches).
>
> Additionally (and more simply because there's a more limited set of cases) you
> can use the mapping file to expand a single character to multiple ones, e.g.
> # ö => o
> "\u00F6" => "o"
>
> but also
> # œ => oe
> "\u0153" => "oe"
>
> I think, but I'm not sure, that you can also have multiple mappings for the
> same source character.
>
> Ben
>
> On 26/07/2016 13:09, Howe, Michael P.A. wrote:
>> Hi Everyone,
>>
>> The treatment of diacriticals is also a big problem in bibliographies, where
>> you don't really have the option of replacing them with Anglicised terms -
>> but you can include a translation, possibly in [square brackets]. Programs
>> such as Endnote will store the diacriticals, but will also return the words
>> if you search without the diacriticals: e.g. Törnquist,, S.L. will be
>> returned if you search for Törnquist or Tornquist. I would recommend
>> configuring your search to work in this manner.
>>
>> However, technically Törnquist is transliterated to Toernquist - and this
>> wouldn't be picked up.
>>
>> In then gets even more complicated - Norwegian, for example, uses ø instead
>> of ö and æ instead of ä and they treat these as extra letters following z.
>> They also have a third additional letter å. Just when you've got used to
>> finding these at the end of the dictionary, you could find a Danish double
>> Aa, which is the same as å and also comes at the end of the dictionary.
>>
>> My advice: Enter the object name exactly as it is spelt in its original
>> language, add an English translation and transliteration in [square
>> brackets], and configure your search engine to ignore diacriticals (if you
>> can).
>>
>> Good luck!
>> Mike
>>
>> Dr Mike Howe
>> Chief Curator
>> Head of the National Geological Repository
>>
>> Phone: 0115 9363105 Email: [log in to unmask]
>> Web: http://www.bgs.ac.uk/staff/profiles/3858.html
>> WSB UGN - British Geological Survey
>> Keyworth, Nottingham, NG12 5GG
>>
>> -----Original Message-----
>> From: Museums Computer Group [mailto:[log in to unmask]] On Behalf Of Robin
>> Patel
>> Sent: 26 July 2016 12:28
>> To: [log in to unmask]
>> Subject: Re: [MCG] Accented and special characters in collections search
>>
>> Hi James,
>>
>> My knowledge is somewhat simple in this field, but would it not be easier to
>> search and replace all object names that use special characters and
>> replacing them with Anglicised terms? Perhaps the 'correct' term could be
>> stored under a 'related' terms field? This is similar to using equivalent
>> names for objects in different languages e.g. the Gaelic name for an object.
>>
>> Am I correct in assuming from a usability point of view, it's highly
>> unlikely that people would search using special characters? Knowing how to
>> input special character when typing is a challenge in itself!
>>
>> Robin
>>
>> --
>> Robin Patel
>> Ergadia Museums & Heritage
>> t: 01786 860 691
>> m: 07815 312 562
>> [log in to unmask]
>> https://ergadiaheritage.com/
>>
>>
>>
>>
>>
>> On 26 July 2016 at 10:07, James Morley <[log in to unmask]> wrote:
>>
>>> Hi all
>>>
>>> We were pondering an issue last night with accented and special
>>> characters in collections search, and wondered if anyone had examples of
>>> best practise?
>>>
>>> Currently at IWM we treat them uniquely, so a search for cafe gives
>>> you
>>> 361 results, and a search for café 200 results. There's only an
>>> overlap of about ten results which have both variants, so about 550
>>> combined. Even more pronounced is aéroplanes (1 result) and aeroplanes
>>> (4900 results).
>>>
>>> We're thinking of indexing against both accented and non-accented
>>> forms, to ensure something with café also gets indexed for cafe - in
>>> other words merging the results. My one concern then is that the user
>>> loses granularity and there could be specific examples where quite a
>>> precise term gets lost in something more generic (though I can't think
>>> of a specific example right now). From a technology point of view
>>> it's all based on Solr, so a thought was to somehow push up relevancy
>>> ranking for the accented/special character matches.
>>>
>>> It's interesting to look at search stats and see that people are quite
>>> extensively using accents and special characters, especially for
>>> people and place names (and a few for aeroplanes, who must have been
>>> quite disappointed!). Also, because of the different collections areas
>>> and historic cataloguing, we seem to have a mix of accurate and 'Anglicised'
>>> names in our collections data!
>>>
>>> Cheers
>>>
>>> James
>>>
>>>
>>> James Morley
>>> Data Developer
>>>
>>> Imperial War Museums
>>> Lambeth Road
>>> London SE1 6HZ
>>>
>>> [log in to unmask]
>>> 07713 360563
>>> iwm.org.uk
>>> @jamesinealing
****************************************************************
website: http://museumscomputergroup.org.uk/
Twitter: http://www.twitter.com/ukmcg
Facebook: http://www.facebook.com/museumscomputergroup
[un]subscribe: http://museumscomputergroup.org.uk/email-list/
****************************************************************
-----------------------------------------------------------------------------------------------------------------------------------------
This email message has been delivered safely and archived online by Mimecast.
For more information please visit http://www.mimecast.com
-----------------------------------------------------------------------------------------------------------------------------------------
****************************************************************
website: http://museumscomputergroup.org.uk/
Twitter: http://www.twitter.com/ukmcg
Facebook: http://www.facebook.com/museumscomputergroup
[un]subscribe: http://museumscomputergroup.org.uk/email-list/
****************************************************************
|