A huge advantage could be attained in data mining and information retrieval processes by placing the vast volume of available unsorted raw data in a meaningful hierarchy. To obtain this functionality the technique of text clustering could be used. Using the method of text clustering a given set of text documents could be divided in to a number of sub sets based on their contextual similarity.
First step of text clustering is to assign a vector for each document to create a way to compare and identify similarities of different documents. To obtain the document vector, first the document is filtered to remove stop words such as articles, conjunctions, prepositions, etc. which bear no content information. Use of thesauri could aid in defining synonyms to enhance the content information gathering. Then based on a statistical analysis of the input document set, a set of key words or index words are chosen to create document vectors. Document vector for each document is generated by calculating the number of occurrences of index words in each document . Then these document vectors are used in an unsupervised clustering algorithm such as Self Organizing Maps, Growing Self Organizing Maps, k-means, etc. to compare documents and come up with a set of document clusters which contains similar documents. |