Document Clustering in Amharic

Submitted by Guy on Fri, 2011-12-09 07:56

Title	Document Clustering in Amharic
Publication Type	Conference Paper
Year of Publication	2011
Authors	Abgaz, Yalemisew
Booktitle	AGIS11 - Action Week for Global Information Sharing (AfLaT2011 Breakout Session)
Location	Addis Ababa, Ethiopia
Abstract	The sheer volume of digital information produced in different languages become a driving force for building efficient systems and tools to organize, store and retrieve the information products. Among these languages Amharic is one of the major African languages which joined the electronic era with a growing number of content and applications. Simultaneously, the need for searching and retrieval of existing information products using Amharic language is rising. However, the availability of tools and techniques which satisfy the need remain unexplored. Taking this into consideration, designing information organization and retrieval systems for Amharic documents became crucial. In this study, the main objective of the research is to understand whether document clustering improves information organization and retrieval performance of documents in Amharic language. We collected 400 news articles published by Walta Information Centre and processed the documents (stop word removal, indexing, and stemming) and stored them in a vector with their corresponding term frequency and inverse document frequency. We used, character mappings and shallow stemming algorithms to process the documents. We built a tool that clusters the documents using frequent item set hierarchical clustering algorithm (FIHC) and generated a hierarchy of clusters of documents by tuning different parameters. A top down cluster search mechanism is implemented to find out the best matching clusters for a query using a centroid vector generated during the clustering process. We used 22 queries to measure the performance of the system using recall and precision and compared our results with the relevance feedback collected from different categories of users. Finally, hierarchical document clustering demonstrated improvement in the performance of the existing information retrieval systems and experiments designed for Amharic language. The study further exhibited promising results in recall and precision and can be used as a complimentary system for other retrieval systems designed for Amharic.

»

Login to post comments
Google Scholar

Also...

User login

Also hosted on AfLaT.org

Register @ aflat.org

Registered members of AfLaT.org can upload publications, add links and information on their research projects. If you would like to become a member of AfLaT.org, please contact guy♻aflat.org.