Proceedings of 2009 International Symposium on Computer Science and Computational Technology (ISCSCT 2009)

Huangshan, China, December 26-28, 2009

Editors: Fei Yu, Guangxue Yue, Jian Shu, Yun Liu

AP Catalog Number: AP-PROC-CS-09CN005

ISBN: 978-952-5726-07-7 (Print), 978-952-5726-08-4 (CD-ROM)

Page(s): 462-466

Suffix Tree Based Chinese Document Feature Extraction and Clustering in RSS Aggregator

        Jian Wan, Wenming Yu, and Xianghua Xu

In RSS aggregator, the important issue is how to make the feeds information more manageable for RSS subscriber. In this paper, we propose a suffix tree based RSS feeds document clustering in Chinese RSS aggregator. We construct a suffix tree with meaningful Chinese words, and choose the phrases with high score given by a formula as document features. We cluster document using group-average algorithm with a new document similarity measure. The experiment results show that the new method can improve the quality of clustering in document “snippets” scenario, and the speed can meet the demand of “on the fly” clustering.

suffix tree, feature extraction, document clustering, RSS

