Home > Table of Contents


Proceedings of 2009 International Symposium on Computer Science and Computational Technology (ISCSCT 2009)

Huangshan, China, December 26-28, 2009

Editors: Fei Yu, Guangxue Yue, Jian Shu, Yun Liu

AP Catalog Number: AP-PROC-CS-09CN005

ISBN: 978-952-5726-07-7 (Print), 978-952-5726-08-4 (CD-ROM)

Page(s): 278-280

Design and Implement of Distributed Document Clustering Based on MapReduce

Jian Wan, Wenming Yu, and Xianghua Xu

Full text: PDF


In this paper, we describe how document clustering for large collection can be efficiently implemented with MapReduce. Hadoop implementation provides a convenient and flexible framework for distributed computing on a cluster of commodity machines. The design and implementation of tfidf and K-Means algorithm on MapReduce is presented. More importantly, we improved the efficiency and effectiveness of the algorithm. Finally, we give the results and some related discussion.

Index Terms

MapReduce, tfidf, K-Means clustering

Copyright @ 2009 ACADEMY PUBLISHER All rights reserved