Tagging and Verifying an Amharic News Corpus

Publication TypeProceedings Article
Year of Conference2012
AuthorsGambäck, Björn
Conference NameProceedings of the workshop on Language technology for normalisation of less-resourced languages (SALTMIL8/AfLaT2012)
PublisherEuropean Language Resources Association (ELRA)
Conference LocationIstanbul, Turkey
ISBN Number978-2-9517408-7-7

The paper describes work on verifying, correcting and retagging a corpus of Amharic news texts. A total of 8715 Amharic news articles had previously been collected from a web site, and part of the corpus (1065 articles; 210,000 words) then morphologically analysed and manually part-of-speech tagged. The tagged corpus has been used as the basis for testing the application to Amharic of machine learning techniques and tools developed for other languages. This process made it possible to spot several errors and inconsistencies in the corpus which has been iteratively refined, cleaned, normalised, split into folds, and partially re-tagged by both automatic and manual means.

