The Ukwabelana corpus - An annotated isiZulu corpus


  • contains 10,000 morphologically labeled words and 3,000 POS-tagged sentences.
  • The corpus comprises around 100,000 common Zulu word types and 30,000 Zulu sentences compiled from fictional works and the Zulu Bible, from which the labeled words and sentences have been sampled.
  • All software and additional data used during the annotation process is provided: the partial grammar in DCG format, the abductive algorithm for parsing with incomplete information and a prototype for a POS tagger which assigns word categories to morphologically analyzed words."