JISCMail - CETIS-METADATA Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
CETIS-METADATA Archives

CETIS-METADATA@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		CETIS-METADATA Home
		CETIS-METADATA September 2004
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: Identifiers again (a global, persistent but non-unique debate)
From:
Andy Powell <[log in to unmask]>
Reply-To:
Andy Powell <[log in to unmask]>
Date:
Sun, 26 Sep 2004 09:01:24 +0100
Content-Type:
TEXT/PLAIN
Parts/Attachments:
TEXT/PLAIN (326 lines)
On Thu, 23 Sep 2004, Mike Collett wrote:

> Here is a contribution to the identifier debate most of which I have
> recently used elsewhere that may be of use in this SIG.

Mike,
thanks... this is helpful.  I disagree with some (perhaps most) of the
things you say! :-) ...but thinking thru my response has helped me to
understand where our differences lie (I think).

What follows is a long and fairly detailed response to the various points
you raise in your message.  For readers who have an interest, but who
can't be bothered with the detail, the executive summary goes something
like this...

I think that the key area where we fundamentally disagree has to do with
the balance between 'identification' and 'resolution'.  I think that
seamless resolution of identifiers by *all* currently deployed
Internet/Web software components (browsers, caches, proxies, Web servers,
email clients, local services, etc.) is critical to the success of any
chosen identifier scheme.  Mike seems to suggest that a clear separation
of 'identification' from 'resolution' (to the point that seamless
resolution by currently deployed technology isn't necessary?) is more
important. I tend to see Mike's viewpoint as being the theoretically
correct one - i.e. it should be possible to clearly separate these issues,
and being able to do so would lead to a better identification system
overall.  But my view is that, in practice, whatever kinds of identifiers
we deploy people and software will want/need to be able to resolve them
(by which I mean that people and software will expect to be able to
retrieve a 'representation' of the resource being identified) - and that
means that all the currently deployed technology needs to understand how
the resolution mechanism works.  If people can't resolve the identifiers
in a completely transparent way, then the chosen identifier scheme will
fail because it will not 'work' as people expect it to work.  The only
existing identifier that works in anything close to a transparent way (by
which I mean that almost *all* software components on the Internet
understand how the resolution mechanism works) is the 'http' URI.  So
either we use 'http' URIs, or we have to get very widespread uptake of any
new identifier scheme by the software infrastructure that makes up the
Internet.  At the moment, the proportion of deployed software (browsers,
etc.) that understand Handles or DOIs or URNs or whatever is very, very
small and this means (IMHO) that they haven't succeeded (yet).
Furthermore, I see very little evidence that these alternative URI schemes
will ever be widely deployed.  (Note: I'd like to be wrong about this
final point).

So to try and sum up my views in one sentence... any chosen identifier
scheme *must* be a valid URI (to provide syntactic integration with other
Internet standards), the URI scheme *must* be registered (to ensure the
uniqueness of identifiers within that scheme) and support for the scheme
*must* be widely deployed within the software infrastructure that makes up
the Internet (to ensure that resolution is transparent to the end-user).

There are only a few candidate URI schemes that meet these requirements
currently (http, ftp, ...), of which the 'http' URI seems the sensible
choice.  This choice would change if, and only if, alternative URI schemes
like 'hdl' become registered URI schemes and become supported by the bulk
of the software infrastructure that makes up the Internet (browsers,
caches, proxies, Web servers, email clients, local systems, etc.).

> With regard to identifiers I think we have to differentiate between at least
> 3 types of persistence (and uniqueness) that can get muddled up.
>
> 1. persistence of the resource (concept, event or ephemeral thing)
> 2. persistence of the identifier - resource relationship (this may be many
> to 1 but should never be x to many)
> 3. persistence of the resolvability of the identifier to something else (the
> resource, metadata or related information)
>
> 1. Everyone seems happy with the idea that the resource may die long before
> the identifier-resource relationship dies.

Agreed.

> 2. Everyone seems happy with the idea that it is important that the
> identifier is globally unique and only ever associated with a single thing.

Yup.

> This can be split into the two separate bits:
>   a.  namespace governance. There seem to be lots of valid candidates and
> some people have their favourite pet namespace. Expressing it in URI syntax
> may be helpful. Whether it is IANA registered or not may be important to
> some, but if you know it is a Handle for example you have some faith that it
> is unique.

Well OK... I'll go with that for now... but (to take Handles as a specific
example) I'd just like to flag up that I don't understand how I'm supposed
to "know" that any given identifier is a Handle unless some syntax tells
me that it is.  Perhaps more importantly, I'm not sure that "some faith"
is really good enough to ensure long term persitence.

And in the context of Internet-delivered services, which is what I assume
we are interested in on this list, conformance with the URI syntax is much
more than just 'helpful' - it is essential... because URIs are really the
*only* globally usable form of identifier on the Internet??

> A possible but unlikely problem is that hdl may be used by others
> in identifiers expressed as uri syntax. Between most communities, and any
> that follow the UKOLN advice,  hdl will be taken as Handle. If this becomes
> a real issue two possible solutions are that another IANA namespace is used
> if uri is essential and that hdl gets IANA registration - which seems just a
> matter of time?.

Yes possibly... though registration of the 'hdl' URI scheme has probably
seemed like 'just a matter of time' for a long time now!?? :-)

>   b. governance of the relationship - this is not so easy without some kind
> of authority organisation or agreement between organisations. The tendency
> is that the relationship is controlled by the publisher/creator of the
> identifier. The persistence of this relationship is as strong or as weak as
> the creator of the ID makes it. In the UK most people would have some faith
> in for example JISC, Becta, e-GU or their successors to maintain the
> relative persistence of these relationships - even if the organisations
> change their (domain) names or disappear as organisations.

I may well be missing your point here... but how can JISC maintain the
identifier-resource relationship if JISC no longer exists?  It can't can
it?  So something else has to take that maintenance on?

> So far the use of
> URLs has not been reliable as people often change the content at a given
> location. By building in the domain name (such as tsoid.org.uk) into the
> identifier arguably weakens the chance of persistence.

Yes agreed.  But this is a feature of the way in which people have chosen
to use 'http' URIs - it is not an inherent problem with them.  I have
argued in the past, and will continue to argue, that people and
organisations that are bad at maintaining the identifier/resource
relationship when using http URIs are likely to be just as bad at
maintaining the relationship with any other kind of identifier.  The
problem is a policy/cultural one, not a technological one.  Changing the
technology doesn't remove the problem - though I would agree that an
additional level of indirection always helps! :-) - it just moves it
elsewhere.

Note that PURLs are a mechanism for adding a level of indirection without
needing to move away from using 'http' URIs.

> 3. The persistence of the resolution is a very separate issue!
> It seems that it is often mixed up with other forms of persistence. In a
> similar way that people have regularly mixed up identification and location
> at the implementation stage. It is also seems to be often assumed or
> expected that:
>     a. the resolution capability can or must be built into the (URI?)
> expression of the identifier
>     b. there will only be a single resolution of the identifier
>
> I think these assumptions are both false but others may disagree??

Well, I certainly agree that there is confusion in these areas. :-)

I also agree that b) is a false assumption, though I suspect that we mean
different things by saying it is false.  Resolution (to me) means
resolving the identifier into one or more 'representations' of the
resource being identified.  There may well be other services based on the
identifier (e.g. a DRM service) but they aren't 'resolution' services -
they're value-added services built around the identifier.  So, saying that
identifiers need multiple resultion means that they should be able to
resolve to multiple 'representations' of the resource, *not* that the
resolution service has to offer multiple kinds of value-added services.
Most Web servers already support multiple resolution of 'http' URIs thru
content negotiation and the like.

I think that a) is a mis-representation of the problem.  Resolution isn't
'built into' the identifier as such - a protocol of some kind is required
to perform the resolution.  (In the case of 'http' URIs, HTTP is the
protocol and an HTTP GET request is how resolution is performed). The
point is that the identifier has to function within a technical
environment (the Internet) which is very widely deployed.  So one of the
major arguments used against the registration of new URI schemes is that
that they incur a very large implementation cost (world-wide) because of
the huge amount of existing software already out there that needs to be
modified to support the new scheme.  When weighed against the minimal
costs of re-using existing URI schemes, only new schemes which demonstrate
benefits that outweigh that cost are likely to be endorsed.  So, for 'hdl'
URIs to be deployed fully, support for that URI scheme would have to be
built into the already deloyed base of software components (browsers,
caches, proxies, local services (to use your phrase below), etc.).  This
is not the case currently.

Now, I think that you argue below that software doesn't have to be
modified to deal with new URI schemes (like 'hdl') because, somehow, the
human end-users of these identifiers will know how to recognise them and
will know where to go to resolve them.  I.e. I will somehow know that when
I'm shown "10.1790/712276811646" I should change it into
"http://hdl.handle.net/10.1790/712276811646" in order to resolve it?  But
how am I supposed to know that?  What tiny proportion of Internet users
currently know that?  And in any case, why would I ever want to show my
users a string of numbers like that?

The point is that identifiers should be as transparent as possible to
end-users and for that to happen *all* the software components that are
used in connection with those identifiers (browsers, caches, proxies,
local services, etc.) have to have knowledge built into them about how to
deal with the identifiers.  Until that happens, the end-users experience
of the identifier will be far from transparent.

So, as a concrete example, DOIs are a good example of a non-transparent
identifier because they do not work seamlessly in the majoritory of
currently deployed software components.  They do work well in closed-world
implementations like CrossRef, but only because knowledge of the DOI has
been built into that particular application.

Currently, in the context of browsing the Web, a 'doi' URI is only
transparent if I'm using IE and I have the DOI plug-in installed.  If I
use any other browser, or if I don't have the plug-in installed, then it
isn't transparent because I have to take some manual action in order to
deal with the DOI.

> For example a user/local system may wish to check for resolution of id xxx
> via a number of preferred services e.g. in the order
> http://www.local.org.uk/mydept/xxx
> http://www.bath.ac.uk/xxx
> http://www.ukoln.ac.uk/xxx
> http:///www.tsoid.org.uk
> As last resort if they all fail then if it is known (suspected) to be a
> Handle for example try
> http://hdl.handle.net/xxx   (if it is known to be a Handle)

But how does the user know that they have to do this?  And while I agree
that local systems could be configured to do it for people, the point is
that, in the global context of the Internet, *all* local systems have to
be modified to work this way because otherwise there isn't any global
predicatability about whether/how the identifier is going to work.

> The doi 10.1790/712276811646 can already be resolved via several domains
> even though they all point to the same place. It can also be very
> effective to effect a Google search on xxx rather than the whole uri
> (try it with 10.1790/712276811646 for example).

But this is also an example of the problem!  Google already has built-in
knowledge about 'http' URIs and it therefore supports various kinds of
'rich' searches based on them.  Contrast this with the rather simplistic
text-string search that you have to do to find a DOI.  In short, Google is
a good example of a 'local system' that doesn't have any knowledge of DOIs
built in, and that therefore can't really deal with them in an effective
way.

> In addition the system may be set up to check one or more digital rights
> management services to see if there are any usage restrictions.
> http://www.digitalrightsmanager.com/xxx
>
> So when it is said that hdl:10.1790/712276811646 or even just
> 10.1790/712276811646 is not globally unique then that may be become an
> issue, but if if it is known that it is Handle or some other well managed
> name space it is not a problem.

Once again, how do I know that '10.1790/712276811646' is a Handle?  The
form prefixed by 'hdl:' is better because if the syntax that I'm dealing
with (e.g. XHTML or an XML schema) tells me to expect a URI and I see
'hdl:10.1790/712276811646' then I know, in some sense, that I've got a
Handle.

This problem is further compounded because the 'hdl' URI scheme isn't
registered yet.  Therefore I actually have no guarantees about uniqueness
or persistence - because it is the scheme registration process that gives
me those two things.

So, to refer back to a couple of the comments you made right at the
beginning, not only is it critical that the identifier is a valid URI, but
it also *must* be a registered URI - because it is those two features that
provide us with the global Internet context that ensure uniqueness and
persistence.  (And it must be a URI scheme supported by *all* the deployed
software components that make up the Internet).

> But when it is said hdl:10.1790/712276811646 or even just
> 10.1790/712276811646 is not resolvable **on its own** I would say that is
> intended and very desirable.

I think this statement runs to the heart of our differences...  I would
argue that this feature is neither intended nor desirable!  My guess is
that Handles were designed to be resolvable, but that the Handle software
hasn't been widely enough deployed in browsers, etc. in order that the
majority of Internet users can take advantage of their resolution
cabability.

> It is very likely that anyone who exposes the Handle id 10.1790/712276811646
> will also prefix it with one or more domains that can resolve it. So the
> resolution and identification may be contained in a single uri but the id
> and resolution are, and can be, separated.

Just to clarify, I presume you mean prefixed by something to form an
'http' URI?

In which case, I would argue that it is the 'http' URI that functions as
the identifier, not the unprefixed string?  But I agree that one could
argue it both ways.

> If the id is a for example a url are there any syntax problems with for
> example trying to resolve
> http://www.egu.gov.uk/http://www.tsoid.org.uk/xxx ???

I'm not 100% sure what you are asking here, nor why?  The URI above is
invalid I think, but proper character encoding could be used to make it
valid.  But... so what!?

> Summary
> The main (or even sole) purpose of a digital identifier is to maintain the
> globally unique persistence of the identifier - resource relationship.
>
> The persistence of the resolution is separate and secondary, but still
> important.

I wonder if it is always the case that resolution is secondary?

> This resolution may be done independently by multiple communities
> or organisations, possibly selected as trusted services by the user.

I certainly don't disagree with this.  But it misses the more important
point - that the identifier needs to work seamlessly in the context of the
currently deployed Web or that we need to have a way of getting very
widespread adoption of a new identifier by all parts of the Web
infrastructure in order that it will work seamlessly.  Clearly, the latter
approach will not be easy, especially given that it is nearly impossible
to register new URI schemes.  In any case, the hard work really only
starts once registration has happened - the hard work is in getting the
majority of technology developers to adopt it into their products.

Andy
--
Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
http://www.ukoln.ac.uk/ukoln/staff/a.powell/      +44 1225 383933
Resource Discovery Network http://www.rdn.ac.uk/
ECDL 2004, Bath, UK - 12-17 Sept 2004 - http://www.ecdl2004.org/
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options