Home > Table of Contents


Proceedings of 2009 International Symposium on Computer Science and Computational Technology (ISCSCT 2009)

Huangshan, China, December 26-28, 2009

Editors: Fei Yu, Guangxue Yue, Jian Shu, Yun Liu

AP Catalog Number: AP-PROC-CS-09CN005

ISBN: 978-952-5726-07-7 (Print), 978-952-5726-08-4 (CD-ROM)

Page(s): 20-25

SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages

Bingfeng Pi, Shunkai Fu, Weilei Wang, and Song Han

Full text: PDF


Detecting near-duplicates within huge repository of short message is known as a challenge due to its short length, frequent happenings of typo when typing on mobile phone, flexibility and diversity nature of Chinese language, and the target we prefer, near-duplicate. In this paper, we discuss the real problem met in real application, and try to look for a suitable technique to solve this problem. We start with the discussion of the seriousness of near-duplicate existing in short messages. Then, we review how SimHash works, and its possible merits for finding near-duplicates. Finally, we demonstrate a series of findings, including the problem itself and the benefits brought by SimHash-based approach, based on experiments with 500 thousands of real short messages crawled from Internet. The discussion here is believed a valuable reference for both researchers and applicants.

Index Terms

Near-duplicate, SimHash, short text

Copyright @ 2009 ACADEMY PUBLISHER All rights reserved