Hi,
Thanks to those who anwered at different statistics discussion lists.
In the first reply I specify a bit more what I am trying to do.
In addition I have found some references that might be of interest:
B.S. Everitt and S.Rabe-Hasketh: "The Analysis of Proximity Data", Kendall's
Library of Statistics 4.
Page. 18-20 handles "Inter-group distance and dissimilarity measures".
Chapter 11 of W.J.Krzanowski: "Principles of Multivariate Analysis", handles
"Highlighting differences between groups."
Anders
****************************************************************************
*********
Underneath is a compilation of the answers I got:
****************************************************************************
*********
You wrote,
>* We are classifying bottles of unknown plastic material into 7 different
>plastic groups, using NIR-spectroscopy on a discrete number of fixed
>wavelengths.
>* Each group is represented by a mean and a covariance matrix, based on a
>sample set of approx. 20 bottles each.
>* Each covariance matrix is different from the other.
>* We would like to optimize the discrete wavelengths used, so that the 7
>plastic clusters are separated as effectively as possible.
>
>Then the question:
>
>How can we objectively measure the separation between clusters?
>
>Is there a standard measure of a normalized distance between two
>hyperellipsoids, which takes into account the covariant matrices of both
>clusters. (Equivalent to the Mahalanobis distance between a point and a
>hyperellipsoid)
The standard way to measure distance between clusters is to measure the
distance between their means. If the covariance matrices were equal, you
could use the Mahalanobis distance. When the covariance matrices are very
different, there is no general theory, but an obvious approach would be to
measure to distances for each pair of cluster centers, one with one
covariance matrix and the other with the other, and to use the smaller
distance.
All of this assumes that your data are reasonably fit by a normal
distribution. If the data has a multimodal distribution, you might want to
use a clustering technique to identify normal subclasses within each class.
I hope that these suggestions are helpful.
-- Dick Duda
****************************************************************************
*********
Here are some misc. notes that I have that may help. These
"separability indices" should be documented in pattern
recognition texts and perhaps clustering books as well.
I am writing from the perspective of classification in
remote sensing.
Let U = mean vector, S = covariance matrix;
i and j are feature indices; tr = trace of a matrix
and inv(x) is the inverse of the ( covariance ) matrix;
and a matrix transpose is denoted by an apostrophe ("'").
Normal distributions are assumed. Oh, and the det() will
indicated the determinant of a matrix.
These separabilities are b/w two features, and not
averaged across all features with pooled covariance
matrices, etc.
o Mahalanobis Distance Separability index**
Two formulations... I would be grateful for anyone clarifying
which is correct, or are both legitimate derivations?
[Ui - Uj]' [inv(Si) + inv(Sj)] [Ui -Uj]
-or-
[Ui - Uj]' [inv((Si + Sj) / 2)] [Ui - Uj]
**anyone have any references for these derivations?
o Divergence
1/2 tr[Si - Sj][inv(Sj) - (Si)] +
[Ui - Uj]'[inv(Si) + inv(Sj)][Ui - Uj]
-or- ( equivalent results)
1/2 tr[Si - Sj][inv(Sj) - inv(Si)] +
1/2 tr[inv(Si) + inv(Sj)][Ui - Uj][Ui - Uj]'
o Bhattacharyya distance
1/8 * [Ui - Uj]'[inv((Si + Sj)/2)][Ui - Uj] +
1/2 ln[ ( det((Si + Sj)/2) / sqrt(det(Si)*det(Sj)) ]
o Jeffreys-Matusita ( sometimes incorrectly Jeffries- )
Let B = Bhattacharyya distance
sqrt([2 (1 - exp(-B)])
o Transformed divergence
Let D = divergence
a[1 - exp(-D/b]
( if i recall correctly, then a=2 and b=8 in Swain and Davis, 1978;
other values have been used ( see Thomas et al. 1989 ))
Obviously, there are strengths and weaknesses to each of these.
References:
Thomas Kailath
1967
The divergence and Bhattacharyya distance measures
in signal selection
IEEE Transactions on Communications, vol. COM-15, 52-60.
Godfried T. Toussaint
1972
Comments on 'The divergence and Bhattacharyya distance measures
in signal selection'.
IEEE Transactions on Communications, vol.COM-20, (no.3, pt.1): 485.
Philip H. Swain and Shirley M. Davis, eds.
Remote sensing : the quantitative approach
1978.
New York : McGraw-Hill
ISBN# 007062576X :
( Jeffreys-Matusita distance )
Philip H. Swain
1982
Pattern recognition techniques for remote sensing applications
Chapter 28 in _Classification, pattern recognition,
and reduction of dimensionality_
( Krishnaiah, Paruchuri R. and Kanal, Laveen N., eds. )
New York : North-Holland Pub. Co. ( Handbook of statistics ; v. 2. ;
ISBN# 044486217X.
I. Thomas et al.
c1989
Classification of remotely sensed images
...
Books by K-S Fu and ( I think ) Fukunaga.
HTH.
-- David
****************************************************************************
*********
SAS's PROC DISCRIM calculates a pairwise generalized squared
distance-between-groups matrix which is found in their
SAS/STAT user's guide version 6 4th edition volume 1 page
680.
(B=)
Brian Schott/Decision Sciences Dept. (404) 651-4070 [log in to unmask]
<mailto:[log in to unmask]>
J Mack Robinson College of Business / Georgia State Univ
Atlanta, Georgia USA 30303-3083 interests: approx. reasoning,
~http://www.Gsu.EDU/~dscbms/ <http://www.Gsu.EDU/dscbms/> (B=) decision
support systems
****************************************************************************
*********
The Mahalanobis distance between two centroids uses the pooled
within-cluster covariance matrix, under the assumption that the covariance
matrices are not significantly different from one another. Most
multivariate texts describe this.
Rich Strauss
===========================================================
Dr Richard E Strauss Phone: 806-742-2719
Associate Professor Fax: 806-742-2963
Biological Sciences
Texas Tech University
Lubbock TX 79409-3131 Email: [log in to unmask]
===========================================================
****************************************************************************
*********
Dear Anders
According to my copy of the Cambridge Dictionary of Statistics, the
Mahalanobis D-squared statistic is a measure of the distance between
two groups. You may find the formula given there useful. The entry
also refers to chapter 4 of Krzanowski's Multivariate Analysis, Part I
(Edward
Arnold, 1994).
Regards
Miland Joshi (Mr.)
Department of Epidemiology and Public Health
University of Leicester
****************************************************************************
*********
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|