Friday, October 23, 2009

How to evaluate the quality of clustering?

In last 3 days I am working in how to evaluate the clustering performance (or quality). The reason is that I need to determine the number of prototype blocks ("good" visual blocks of example) given a set of them for each category (e.g. nervous, muscle, etc.). For this, I have the similarity matrix of combined visual features (linear combination using weights founded by kernel aligment method) of image blocks.

I am using k-centers (also called k-medoids) method to find the k image blocks that can be a good representation of visual variability in each concept. The problem is: What k value is appropiate for select the representative blocks according the visual variability inside set of blocks images for a given concept?

One of the most popular methods to select the right value of k is by means of the silhouette coefficients.

The method to calculate the silhouette coefficient is:

For a given i point in a cluster A, the silhouette of i, s(i) is defined as follows:

s(i) = [ b(i) - a(i) ] / max { a(i) , b(i) }

where, a(i) is the average of dissimilarity between point i and all other points in A (the clusters to which i belongs) and b(i) is the average dissimilarity between point i and the points in the closest cluster to A, which is B in this case. Seek that -1<= s(i) <= 1. That means:

s(i) closest to 1, the object i is well classified
s(i) closest to 0, the object i is between two clusters
s(i) closest to -1, the object i is wrong classified


The average of all silhouettes in the data set S' is called the average silhouettes width for all points in the data set. The value S' will be denoted by S'(k), which is used for the selection of the right value of the number of clusters, k, by choosing that k for which S'(k) is as high as possible. The Silhoette Coefficient (SC) is then defined as follows:

SC = max { S'(k) }

Partial solution
The first, I had to adapt the silhouette coefficient method for calculate it from similarity matrix. The method was developed in matlab and can be downloaded here.

I did several experiments varying the k value for a specific concept, but I found that when k value increases, the SC value also.

The reason that found is that when a cluster have only a object, the a(i) value of s(i) formula is NaN, because there are not others objects in the same cluster. This, when use the next formula

SumDist := sum(distances(i,j) | i ~= j, for all j belongs to Ai)
NObjs := count( j | for all j belongs to Ai)

where, i is the object and belongs to Ai cluster, and j is the others objects. Then I calculate a(i) thus:

a(i) = SumDist/Nobjs

Then, the question is What must we do when Nobjs is cero? that means, How do take into account when a cluster have just one element?, How improve the silhouette measure using this?, Is it possible to include in some measure that penalizes when the number of objects per cluster is smaller than a given value?

Now my idea is vary the k-value until appears a cluster with just one element. In other words maximize the SC value and minimize the number-of-clusters-with-one-element.

I going try some experiments with this idea...

If you have some comments, if I'm doing something wrong or any idea to find a good value of k in an objective way, your comments will be welcome.

While, I will testing to finally show results of these experiments.

P.D.: Solve this problem can be useful also for summarize and 2D-visualization a large collection of images. I believe...

References
Kaufman, L. and P. Rousseeuw, 1990. Finding Groups in Data: An Introduction to ClusterAnalysis. John Wiley and Sons, London. ISBN: 10:0471878766.

2 comments:

  1. I studied some of these issues some time ago and I have some notes about it. If you want you can pass by the lab and I will share them with you. We could discuss too :P

    Meanwhile, I would say that a(i) might be set to 0 in the case that i is the only element in the cluster A, following the idea that a(i) is intended to measure how dissimilar i is to the cluster it belongs to.

    On the other hand, I think it would be helpful if you try to check if the algorithm is finding a meaningful pattern in the data by comparing the SC values you are obtaining to the values obtained on random data. I'm not sure if it is possible in your case as I see you are using a kernel method: that was a question that remained open when I studied the topic.

    ReplyDelete
  2. The general question of the post is a difficult one, and I think is not yet satisfactorily solved. You can check http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf (chap 8 from Tan's book) for some current methods. I think, Angel's question is actually more pragmatic: How to determine a good value for the number of clusters?.

    The general approach is ok. You try different values for k and apply an objective measure such as the silhouette coefficient. I would suggest not to expend much time on this, since you don't need the exact number of clusters but just an approximate value.

    Regarding the question of what should be the value of s(i) for a cluster with only one element. Well, s(i)=0 makes sense since this will indicate that the object is a borderline point (not good but not bad). However, if you are getting clusters with only one element, this means that k is probably very high for the data set. So it is good idea to dismiss these cases as you propose.

    Also, it is important to remember that k-medoids, as well as k-means, is sensitive to the initial conditions, so you want to run several experiments for the same k value and calculate the average.

    How many data samples do you have? How do you initialize the algorithm? It would be interesting to see the kind of plots that you are getting.

    ReplyDelete