I checked the R procedure HCLUST (hierarchical clustering) but it looks like
it requires a full triangular n x n similarity matrix as input, where n =
number of observations. The number of variables is 200.
My data set has n = 50,000 observations (keywords), and I use ad-hoc
similarity measures, not available in R, to measure keyword similarity.
Here, the vast majority of the n x n similarities are equal to zero.
So I am looking for a clustering procedure that would accept the following
alternate input:
x1, y1, s1
x2, y2, s2
...
xk, yk, sk
where xi, yi are 2 keywords with similarity si > 0 (1 <= i <= k). This input
would contain k = 10,000 rows, which is much smaller than n x n = 50,000 x
50,000 elements when using the similarity matrix. The HCLUST function would
crash if it used the dissimilarity matrix as input.
Do you know how to use my small data input in R, instead of a very large
sparse similarity matrix? Or in SAS? I need a simple solution, otherwise
I'll just write myself the code that does hierarchical clustering, in C or
Perl, or use a library. It would take me 2 hours to write the hierarchical
clustering code from scratch, so I'm looking for a simple solution that will
take less than 2 hours to implement.
Thank you,
Vincent
Follow up at: http://tinyurl.com/y8wswk7
|