Our aim is to come up with a sound and concrete solution so that the gigantic volume of unsorted data could be fitted in to a virtual hierarchy to make extracting relevant data more feasible.
 
Recently we did a research on selecting the best clustering alogorithm which optimize our requirements in the project. Currently we are preparing the system design documentation.
view details...
This is our proposed solution

A huge advantage could be attained in data mining and information retrieval processes by placing the vast volume of available unsorted raw data in a meaningful hierarchy. To obtain this functionality the technique of text clustering could be used. Using the method of text clustering a given set of text documents could be divided in to a number of sub sets based on their contextual similarity.

First step of text clustering is to assign a vector for each document to create a way to compare and identify similarities of different documents. To obtain the document vector, first the document is filtered to remove stop words such as articles, conjunctions, prepositions, etc. which bear no content information. Use of thesauri could aid in defining synonyms to enhance the content information gathering. Then based on a statistical analysis of the input document set, a set of key words or index words are chosen to create document vectors. Document vector for each document is generated by calculating the number of occurrences of index words in each document . Then these document vectors are used in an unsupervised clustering algorithm such as Self Organizing Maps, Growing Self Organizing Maps, k-means, etc. to compare documents and come up with a set of document clusters which contains similar documents.

By applying above steps again for a cluster of documents, a refined set of clusters could be obtained from the original cluster of documents. By applying this procedure repeatedly, a meaningful hierarchy could be obtained from the initial set of unsorted text documents.