Angel Cruz-Roa: October 2009

Sunday, October 25, 2009

Designing the poster for CIARP

Hi buddies, in a few days I hope to be in Guadalajara presenting a poster about the work that was accepted at 14th Iberoamerican Congress of Pattern Recognition. For this reason I must do it and I started to design it. So, I show the preliminar design to give me your opinion to improve it. Thank you very much.

The image is avaliable also in Flickr here.

Friday, October 23, 2009

How to evaluate the quality of clustering?

In last 3 days I am working in how to evaluate the clustering performance (or quality). The reason is that I need to determine the number of prototype blocks ("good" visual blocks of example) given a set of them for each category (e.g. nervous, muscle, etc.). For this, I have the similarity matrix of combined visual features (linear combination using weights founded by kernel aligment method) of image blocks.

I am using k-centers (also called k-medoids) method to find the k image blocks that can be a good representation of visual variability in each concept. The problem is: What k value is appropiate for select the representative blocks according the visual variability inside set of blocks images for a given concept?

One of the most popular methods to select the right value of k is by means of the silhouette coefficients.

The method to calculate the silhouette coefficient is:

For a given i point in a cluster A, the silhouette of i, s(i) is defined as follows:

s(i) = [ b(i) - a(i) ] / max { a(i) , b(i) }

where, a(i) is the average of dissimilarity between point i and all other points in A (the clusters to which i belongs) and b(i) is the average dissimilarity between point i and the points in the closest cluster to A, which is B in this case. Seek that -1<= s(i) <= 1. That means:

s(i) closest to 1, the object i is well classified
s(i) closest to 0, the object i is between two clusters
s(i) closest to -1, the object i is wrong classified

The average of all silhouettes in the data set S' is called the average silhouettes width for all points in the data set. The value S' will be denoted by S'(k), which is used for the selection of the right value of the number of clusters, k, by choosing that k for which S'(k) is as high as possible. The Silhoette Coefficient (SC) is then defined as follows:

SC = max { S'(k) }

Partial solution
The first, I had to adapt the silhouette coefficient method for calculate it from similarity matrix. The method was developed in matlab and can be downloaded here.

I did several experiments varying the k value for a specific concept, but I found that when k value increases, the SC value also.

The reason that found is that when a cluster have only a object, the a(i) value of s(i) formula is NaN, because there are not others objects in the same cluster. This, when use the next formula

SumDist := sum(distances(i,j) | i ~= j, for all j belongs to Ai)
NObjs := count( j | for all j belongs to Ai)

where, i is the object and belongs to Ai cluster, and j is the others objects. Then I calculate a(i) thus:

a(i) = SumDist/Nobjs

Then, the question is What must we do when Nobjs is cero? that means, How do take into account when a cluster have just one element?, How improve the silhouette measure using this?, Is it possible to include in some measure that penalizes when the number of objects per cluster is smaller than a given value?

Now my idea is vary the k-value until appears a cluster with just one element. In other words maximize the SC value and minimize the number-of-clusters-with-one-element.

I going try some experiments with this idea...

If you have some comments, if I'm doing something wrong or any idea to find a good value of k in an objective way, your comments will be welcome.

While, I will testing to finally show results of these experiments.

P.D.: Solve this problem can be useful also for summarize and 2D-visualization a large collection of images. I believe...

References
Kaufman, L. and P. Rousseeuw, 1990. Finding Groups in Data: An Introduction to ClusterAnalysis. John Wiley and Sons, London. ISBN: 10:0471878766.

Wednesday, October 14, 2009

Sparse representations

A "recent" paradigm to represent digital images has been used in signals previously and promises be the holy grail in computer vision by striking results. Then, the question is Why not study it?.

The first motivation is that a possible relation with my master's thesis exists.

In bag of features, we split the images in blocks, commonly called visual words, and then a feature description is done to represent these visual words. The process is performed in all images in a specific image collection. Then a visual codebook is built with more representative visual words in collection, an approach commonly used is by clustering, i.e. k-means. The visual codebook, or dictionary, is built and each image is represented by the occurrence of visual words according to codebook in image, the asignation is made by the most similar measure between visual word in image and a visual word of codebook.

In sparse representation, we choose a random blocks in an image, called dictionary D. Then, we want obtain a vector x that help to reconstruct the original image how a linear combination between them. The optimization problem is defined by sparse measure of zero norm and the best solution is given by the x vector most sparse. However, the D is not a square matrix and is indetermined problem with number of observations (cols) is greater than basis dimension (rows), so have many infinite solutions. The best solution is given by the x vector most sparse in norm zero, but it is a NP-hard problem. The sparse measure in norm one is a good aproximation and is the same solution in some cases of original optimization problem with advantage that is possible solve with a LP method (basis pursuit, matching pursuit, orthogonal matching pursuit, among others). With sparse solutions the dictionary is the best set of basis that represents the image content and more compact representation of image than fourier, wavelets, curvelets, etc.

I am working in this moment in this approach and how can help me in my master thesis... I hope :)

Angel Cruz-Roa