ISSN : 1796-203X
Volume : 4    Issue : 3    Date : March 2009

An Improved KNN Text Classification Algorithm Based on Clustering
Yong Zhou, Youwen Li, and Shixiong Xia
Page(s): 230-237
Full Text:
PDF (177 KB)

The traditional KNN text classification algorithm used all training samples for classification, so it
had a huge number of training samples and a high degree of calculation complexity, and it also
didn’t reflect the different importance of different samples. In allusion to the problems mentioned
above, an improved KNN text classification algorithm based on clustering center is proposed in this
paper. Firstly, the given training sets are compressed and the samples near by the border are
deleted, so the multipeak effect of the training sample sets is eliminated. Secondly, the training
sample sets of each category are clustered by k-means clustering algorithm, and all cluster centers
are taken as the new training samples. Thirdly, a weight value is introduced, which indicates the
importance of each training sample according to the number of samples in the cluster that contains
this cluster center. Finally, the modified samples are used to accomplish KNN text classification.
The simulation results show that the algorithm proposed in this paper can not only effectively
reduce the actual number of training samples and lower the calculation complexity, but also
improve the accuracy of KNN text classification algorithm.

Index Terms
Text classification, KNN algorithm, sample austerity, cluster