JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for CETIS-METADATA Archives


CETIS-METADATA Archives

CETIS-METADATA Archives


CETIS-METADATA@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

CETIS-METADATA Home

CETIS-METADATA Home

CETIS-METADATA  September 2004

CETIS-METADATA September 2004

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: Identifiers again (a global, persistent but non-unique debate)

From:

Andy Powell <[log in to unmask]>

Reply-To:

Andy Powell <[log in to unmask]>

Date:

Sun, 26 Sep 2004 09:01:24 +0100

Content-Type:

TEXT/PLAIN

Parts/Attachments:

Parts/Attachments

TEXT/PLAIN (326 lines)

On Thu, 23 Sep 2004, Mike Collett wrote:

> Here is a contribution to the identifier debate most of which I have
> recently used elsewhere that may be of use in this SIG.

Mike,
thanks... this is helpful.  I disagree with some (perhaps most) of the
things you say! :-) ...but thinking thru my response has helped me to
understand where our differences lie (I think).

What follows is a long and fairly detailed response to the various points
you raise in your message.  For readers who have an interest, but who
can't be bothered with the detail, the executive summary goes something
like this...

I think that the key area where we fundamentally disagree has to do with
the balance between 'identification' and 'resolution'.  I think that
seamless resolution of identifiers by *all* currently deployed
Internet/Web software components (browsers, caches, proxies, Web servers,
email clients, local services, etc.) is critical to the success of any
chosen identifier scheme.  Mike seems to suggest that a clear separation
of 'identification' from 'resolution' (to the point that seamless
resolution by currently deployed technology isn't necessary?) is more
important. I tend to see Mike's viewpoint as being the theoretically
correct one - i.e. it should be possible to clearly separate these issues,
and being able to do so would lead to a better identification system
overall.  But my view is that, in practice, whatever kinds of identifiers
we deploy people and software will want/need to be able to resolve them
(by which I mean that people and software will expect to be able to
retrieve a 'representation' of the resource being identified) - and that
means that all the currently deployed technology needs to understand how
the resolution mechanism works.  If people can't resolve the identifiers
in a completely transparent way, then the chosen identifier scheme will
fail because it will not 'work' as people expect it to work.  The only
existing identifier that works in anything close to a transparent way (by
which I mean that almost *all* software components on the Internet
understand how the resolution mechanism works) is the 'http' URI.  So
either we use 'http' URIs, or we have to get very widespread uptake of any
new identifier scheme by the software infrastructure that makes up the
Internet.  At the moment, the proportion of deployed software (browsers,
etc.) that understand Handles or DOIs or URNs or whatever is very, very
small and this means (IMHO) that they haven't succeeded (yet).
Furthermore, I see very little evidence that these alternative URI schemes
will ever be widely deployed.  (Note: I'd like to be wrong about this
final point).

So to try and sum up my views in one sentence... any chosen identifier
scheme *must* be a valid URI (to provide syntactic integration with other
Internet standards), the URI scheme *must* be registered (to ensure the
uniqueness of identifiers within that scheme) and support for the scheme
*must* be widely deployed within the software infrastructure that makes up
the Internet (to ensure that resolution is transparent to the end-user).

There are only a few candidate URI schemes that meet these requirements
currently (http, ftp, ...), of which the 'http' URI seems the sensible
choice.  This choice would change if, and only if, alternative URI schemes
like 'hdl' become registered URI schemes and become supported by the bulk
of the software infrastructure that makes up the Internet (browsers,
caches, proxies, Web servers, email clients, local systems, etc.).

> With regard to identifiers I think we have to differentiate between at least
> 3 types of persistence (and uniqueness) that can get muddled up.
>
> 1. persistence of the resource (concept, event or ephemeral thing)
> 2. persistence of the identifier - resource relationship (this may be many
> to 1 but should never be x to many)
> 3. persistence of the resolvability of the identifier to something else (the
> resource, metadata or related information)
>
> 1. Everyone seems happy with the idea that the resource may die long before
> the identifier-resource relationship dies.

Agreed.

> 2. Everyone seems happy with the idea that it is important that the
> identifier is globally unique and only ever associated with a single thing.

Yup.

> This can be split into the two separate bits:
>   a.  namespace governance. There seem to be lots of valid candidates and
> some people have their favourite pet namespace. Expressing it in URI syntax
> may be helpful. Whether it is IANA registered or not may be important to
> some, but if you know it is a Handle for example you have some faith that it
> is unique.

Well OK... I'll go with that for now... but (to take Handles as a specific
example) I'd just like to flag up that I don't understand how I'm supposed
to "know" that any given identifier is a Handle unless some syntax tells
me that it is.  Perhaps more importantly, I'm not sure that "some faith"
is really good enough to ensure long term persitence.

And in the context of Internet-delivered services, which is what I assume
we are interested in on this list, conformance with the URI syntax is much
more than just 'helpful' - it is essential... because URIs are really the
*only* globally usable form of identifier on the Internet??

> A possible but unlikely problem is that hdl may be used by others
> in identifiers expressed as uri syntax. Between most communities, and any
> that follow the UKOLN advice,  hdl will be taken as Handle. If this becomes
> a real issue two possible solutions are that another IANA namespace is used
> if uri is essential and that hdl gets IANA registration - which seems just a
> matter of time?.

Yes possibly... though registration of the 'hdl' URI scheme has probably
seemed like 'just a matter of time' for a long time now!?? :-)

>   b. governance of the relationship - this is not so easy without some kind
> of authority organisation or agreement between organisations. The tendency
> is that the relationship is controlled by the publisher/creator of the
> identifier. The persistence of this relationship is as strong or as weak as
> the creator of the ID makes it. In the UK most people would have some faith
> in for example JISC, Becta, e-GU or their successors to maintain the
> relative persistence of these relationships - even if the organisations
> change their (domain) names or disappear as organisations.

I may well be missing your point here... but how can JISC maintain the
identifier-resource relationship if JISC no longer exists?  It can't can
it?  So something else has to take that maintenance on?

> So far the use of
> URLs has not been reliable as people often change the content at a given
> location. By building in the domain name (such as tsoid.org.uk) into the
> identifier arguably weakens the chance of persistence.

Yes agreed.  But this is a feature of the way in which people have chosen
to use 'http' URIs - it is not an inherent problem with them.  I have
argued in the past, and will continue to argue, that people and
organisations that are bad at maintaining the identifier/resource
relationship when using http URIs are likely to be just as bad at
maintaining the relationship with any other kind of identifier.  The
problem is a policy/cultural one, not a technological one.  Changing the
technology doesn't remove the problem - though I would agree that an
additional level of indirection always helps! :-) - it just moves it
elsewhere.

Note that PURLs are a mechanism for adding a level of indirection without
needing to move away from using 'http' URIs.

> 3. The persistence of the resolution is a very separate issue!
> It seems that it is often mixed up with other forms of persistence. In a
> similar way that people have regularly mixed up identification and location
> at the implementation stage. It is also seems to be often assumed or
> expected that:
>     a. the resolution capability can or must be built into the (URI?)
> expression of the identifier
>     b. there will only be a single resolution of the identifier
>
> I think these assumptions are both false but others may disagree??

Well, I certainly agree that there is confusion in these areas. :-)

I also agree that b) is a false assumption, though I suspect that we mean
different things by saying it is false.  Resolution (to me) means
resolving the identifier into one or more 'representations' of the
resource being identified.  There may well be other services based on the
identifier (e.g. a DRM service) but they aren't 'resolution' services -
they're value-added services built around the identifier.  So, saying that
identifiers need multiple resultion means that they should be able to
resolve to multiple 'representations' of the resource, *not* that the
resolution service has to offer multiple kinds of value-added services.
Most Web servers already support multiple resolution of 'http' URIs thru
content negotiation and the like.

I think that a) is a mis-representation of the problem.  Resolution isn't
'built into' the identifier as such - a protocol of some kind is required
to perform the resolution.  (In the case of 'http' URIs, HTTP is the
protocol and an HTTP GET request is how resolution is performed). The
point is that the identifier has to function within a technical
environment (the Internet) which is very widely deployed.  So one of the
major arguments used against the registration of new URI schemes is that
that they incur a very large implementation cost (world-wide) because of
the huge amount of existing software already out there that needs to be
modified to support the new scheme.  When weighed against the minimal
costs of re-using existing URI schemes, only new schemes which demonstrate
benefits that outweigh that cost are likely to be endorsed.  So, for 'hdl'
URIs to be deployed fully, support for that URI scheme would have to be
built into the already deloyed base of software components (browsers,
caches, proxies, local services (to use your phrase below), etc.).  This
is not the case currently.

Now, I think that you argue below that software doesn't have to be
modified to deal with new URI schemes (like 'hdl') because, somehow, the
human end-users of these identifiers will know how to recognise them and
will know where to go to resolve them.  I.e. I will somehow know that when
I'm shown "10.1790/712276811646" I should change it into
"http://hdl.handle.net/10.1790/712276811646" in order to resolve it?  But
how am I supposed to know that?  What tiny proportion of Internet users
currently know that?  And in any case, why would I ever want to show my
users a string of numbers like that?

The point is that identifiers should be as transparent as possible to
end-users and for that to happen *all* the software components that are
used in connection with those identifiers (browsers, caches, proxies,
local services, etc.) have to have knowledge built into them about how to
deal with the identifiers.  Until that happens, the end-users experience
of the identifier will be far from transparent.

So, as a concrete example, DOIs are a good example of a non-transparent
identifier because they do not work seamlessly in the majoritory of
currently deployed software components.  They do work well in closed-world
implementations like CrossRef, but only because knowledge of the DOI has
been built into that particular application.

Currently, in the context of browsing the Web, a 'doi' URI is only
transparent if I'm using IE and I have the DOI plug-in installed.  If I
use any other browser, or if I don't have the plug-in installed, then it
isn't transparent because I have to take some manual action in order to
deal with the DOI.

> For example a user/local system may wish to check for resolution of id xxx
> via a number of preferred services e.g. in the order
> http://www.local.org.uk/mydept/xxx
> http://www.bath.ac.uk/xxx
> http://www.ukoln.ac.uk/xxx
> http:///www.tsoid.org.uk
> As last resort if they all fail then if it is known (suspected) to be a
> Handle for example try
> http://hdl.handle.net/xxx   (if it is known to be a Handle)

But how does the user know that they have to do this?  And while I agree
that local systems could be configured to do it for people, the point is
that, in the global context of the Internet, *all* local systems have to
be modified to work this way because otherwise there isn't any global
predicatability about whether/how the identifier is going to work.

> The doi 10.1790/712276811646 can already be resolved via several domains
> even though they all point to the same place. It can also be very
> effective to effect a Google search on xxx rather than the whole uri
> (try it with 10.1790/712276811646 for example).

But this is also an example of the problem!  Google already has built-in
knowledge about 'http' URIs and it therefore supports various kinds of
'rich' searches based on them.  Contrast this with the rather simplistic
text-string search that you have to do to find a DOI.  In short, Google is
a good example of a 'local system' that doesn't have any knowledge of DOIs
built in, and that therefore can't really deal with them in an effective
way.

> In addition the system may be set up to check one or more digital rights
> management services to see if there are any usage restrictions.
> http://www.digitalrightsmanager.com/xxx
>
> So when it is said that hdl:10.1790/712276811646 or even just
> 10.1790/712276811646 is not globally unique then that may be become an
> issue, but if if it is known that it is Handle or some other well managed
> name space it is not a problem.

Once again, how do I know that '10.1790/712276811646' is a Handle?  The
form prefixed by 'hdl:' is better because if the syntax that I'm dealing
with (e.g. XHTML or an XML schema) tells me to expect a URI and I see
'hdl:10.1790/712276811646' then I know, in some sense, that I've got a
Handle.

This problem is further compounded because the 'hdl' URI scheme isn't
registered yet.  Therefore I actually have no guarantees about uniqueness
or persistence - because it is the scheme registration process that gives
me those two things.

So, to refer back to a couple of the comments you made right at the
beginning, not only is it critical that the identifier is a valid URI, but
it also *must* be a registered URI - because it is those two features that
provide us with the global Internet context that ensure uniqueness and
persistence.  (And it must be a URI scheme supported by *all* the deployed
software components that make up the Internet).

> But when it is said hdl:10.1790/712276811646 or even just
> 10.1790/712276811646 is not resolvable **on its own** I would say that is
> intended and very desirable.

I think this statement runs to the heart of our differences...  I would
argue that this feature is neither intended nor desirable!  My guess is
that Handles were designed to be resolvable, but that the Handle software
hasn't been widely enough deployed in browsers, etc. in order that the
majority of Internet users can take advantage of their resolution
cabability.

> It is very likely that anyone who exposes the Handle id 10.1790/712276811646
> will also prefix it with one or more domains that can resolve it. So the
> resolution and identification may be contained in a single uri but the id
> and resolution are, and can be, separated.

Just to clarify, I presume you mean prefixed by something to form an
'http' URI?

In which case, I would argue that it is the 'http' URI that functions as
the identifier, not the unprefixed string?  But I agree that one could
argue it both ways.

> If the id is a for example a url are there any syntax problems with for
> example trying to resolve
> http://www.egu.gov.uk/http://www.tsoid.org.uk/xxx ???

I'm not 100% sure what you are asking here, nor why?  The URI above is
invalid I think, but proper character encoding could be used to make it
valid.  But... so what!?

> Summary
> The main (or even sole) purpose of a digital identifier is to maintain the
> globally unique persistence of the identifier - resource relationship.
>
> The persistence of the resolution is separate and secondary, but still
> important.

I wonder if it is always the case that resolution is secondary?

> This resolution may be done independently by multiple communities
> or organisations, possibly selected as trusted services by the user.

I certainly don't disagree with this.  But it misses the more important
point - that the identifier needs to work seamlessly in the context of the
currently deployed Web or that we need to have a way of getting very
widespread adoption of a new identifier by all parts of the Web
infrastructure in order that it will work seamlessly.  Clearly, the latter
approach will not be easy, especially given that it is nearly impossible
to register new URI schemes.  In any case, the hard work really only
starts once registration has happened - the hard work is in getting the
majority of technology developers to adopt it into their products.

Andy
--
Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
http://www.ukoln.ac.uk/ukoln/staff/a.powell/      +44 1225 383933
Resource Discovery Network http://www.rdn.ac.uk/
ECDL 2004, Bath, UK - 12-17 Sept 2004 - http://www.ecdl2004.org/

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
October 2022
August 2022
July 2022
May 2022
April 2022
March 2022
January 2022
November 2021
September 2021
May 2021
April 2021
February 2021
November 2020
September 2020
August 2020
July 2020
June 2020
March 2020
February 2020
September 2019
August 2019
July 2019
June 2019
April 2019
February 2019
December 2018
November 2018
September 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
September 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
February 2003
January 2003
December 2002
November 2002
October 2002
September 2002
August 2002
July 2002
June 2002
May 2002
April 2002
March 2002
February 2002
January 2002
December 2001
November 2001
October 2001
September 2001
August 2001
July 2001
June 2001


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager