Document Clustering in Amharic

TitleDocument Clustering in Amharic
Publication TypeConference Paper
Year of Publication2011
AuthorsAbgaz, Yalemisew
BooktitleAGIS11 - Action Week for Global Information Sharing (AfLaT2011 Breakout Session)
LocationAddis Ababa, Ethiopia
Abstract

The sheer volume of digital information produced in different languages become a driving force for building efficient systems and tools to organize, store and retrieve the information products. Among these languages Amharic is one of the major African languages which joined the electronic era with a growing number of content and applications. Simultaneously, the need for searching and retrieval of existing information products using Amharic language is rising. However, the availability of tools and techniques which satisfy the need remain unexplored. Taking this into consideration, designing information organization and retrieval systems for Amharic documents became crucial.
In this study, the main objective of the research is to understand whether document clustering improves information organization and retrieval performance of documents in Amharic language. We collected 400 news articles published by Walta Information Centre and processed the documents (stop word removal, indexing, and stemming) and stored them in a vector with their corresponding term frequency and inverse document frequency. We used, character mappings and shallow stemming algorithms to process the documents. We built a tool that clusters the documents using frequent item set hierarchical clustering algorithm (FIHC) and generated a hierarchy of clusters of documents by tuning different parameters. A top down cluster search mechanism is implemented to find out the best matching clusters for a query using a centroid vector generated during the clustering process.
We used 22 queries to measure the performance of the system using recall and precision and compared our results with the relevance feedback collected from different categories of users. Finally, hierarchical document clustering demonstrated improvement in the performance of the existing information retrieval systems and experiments designed for Amharic language. The study further exhibited promising results in recall and precision and can be used as a complimentary system for other retrieval systems designed for Amharic.