Finite state tokenisation of an orthographical disjunctive agglutinative language: The verbal segment of Northern Sotho

TitleFinite state tokenisation of an orthographical disjunctive agglutinative language: The verbal segment of Northern Sotho
Publication TypeProceedings Article
Year of Conference2006
AuthorsAnderson, Winston, and Kotzé Petronella M.
Conference NameFifth International Conference on Language Resources and Evaluation
Pagination1906-1911
Conference Start Date24/05/2006
PublisherEuropean Language Resources Association
Conference LocationGenoa, Italy
KeywordsNorthern Sotho tokenisation tokenisation
Abstract

Tokenisation is an important first pre-processing step required to adequately test finite-state morphological analysers. In agglutinative languages each morpheme is concatinatively added on to form a complete morphological structure. Disjunctive agglutinative languages like Northern Sotho write these morphemes, for certain morphological categories only, as separate words separated by spaces or line breaks. These breaks are, by their nature, different from breaks that separate ``words'' that are written conjunctively. A tokeniser is required to isolate categories, like a verb, from raw text before they can be correctly morphologically analysed. The authors have successfully produced a finite state tokeniser for Northern Sotho, where verb segments are written disjunctively but nominal segments conjunctively. The authors show that since reduplication in the Northern Sotho language does not affect the pre-processing tokeniser, the disjunctive standard verbal segment as a construct in Northern Sotho is deterministic, finite-state and a regular Type 0 language in the Chomsky hierarchy and that the copulative verbal segment, due to its semi-disjunctivism, is ambiguously non-deterministic.