Eastern Africa

warning: Creating default object from empty value in /home/webserver/html/aflat/modules/taxonomy/taxonomy.pages.inc on line 33.

Helsinki Corpus of Swahili


Helsinki Corpus of Swahili contains 12,5 million words of text from a number of current news sources as well as extracts from a large number of books. Typing errors of texts have been manually corrected. The corpus was tagged with SALAMA without human intervention. With a signed contract the corpus is available for scientific research without charge.
The corpus can be accessed through the web-based browser Lemmie 2.0. A direct access to the Linux server is also possible. Currently it is not possible to access the English glosses with Lemmie 2.0. So the users needing the English glosses might wish to use the Linux interface.
Currently HCS does not have syntactic tags. In future we wish to enrich the corpus with those tags, together with a number of new features, including a large number of idioms and multi-word expressions. Also new texts will be added.

SALAMA - Swahili Language Manager


SALAMA is a rule-based system for managing a number of applications on Swahili. It includes a tokenizer, morphological analyzer, morphological disambiguator, semantic disambiguator, syntactic analyzer, and a text-based machine translation system from Swahili to English.
In morphology we have earlier used finite state methods with two-level description. Recently we have developed an alternative method using a combination of different rule types. We have used pattern matching rules for initial analysis (sufficient e.g. for a spell checker), and the accurate analysis with due ambiguity is produced with post-processing rules.
In disambiguation and syntactic mapping we have used a Constraint Grammar parser (CG2), licensed from Connexor. This parser is also used in isolating and handling multi-word expressions.
SALAMA includes also a number of rules for constituent re-ordering, and for transferring the grammatical and lexical information from Swahili into correct English language.

Google Interface in African Languages


Google currently offers its interface in the following African languages:

Language Internet address
Afrikaans http://www.google.com/intl/af/
Amharic http://www.google.com/intl/am/
Lingála http://www.google.com/intl/ln/
Sesotho http://www.google.com/intl/st/
Shona http://www.google.com/intl/sn/
Somali http://www.google.com/intl/so/
Swahili http://www.google.com/intl/sw/
Tigrinya http://www.google.com/intl/ti/
Twi http://www.google.com/intl/tw/
Xhosa http://www.google.com/intl/xh/
Yoruba http://www.google.com/intl/yo/
Zulu http://www.google.com/intl/zu/


The School of Computing & Informatics of the University of Nairobi is hosting COSCIT 2007, the first International Computer Science and ICT Conference.

A separate NLP session / workshop is planned.

More information:

English - Luganda Parallel Corpus

A parallel corpus consists of the same text in two or more different languages. Word-alignment involves finding the links between the words in the two texts.  A large word-aligned corpus can be used as source material for statistical machine translation techniques and knowledge transfer techniques.

On this page, you can download a small word-aligned parallel corpus Luganda - English. It consists of 150 manually annotated sentences of the gospel of Luke (1:1 until 3:18). The English text is the King James Bible and the Luganda text was taken from the on-line Luganda bible.

Needless to say this is a very modest-size corpus and cannot be used as the only dataset to bootstrap MT research. Its purpose however it to provide a gold-standard test set to evaluate and tune automatic word-alignment techniques for larger parallel corpora English-Luganda.
The files were made using the UMIACS Word Alignment Interface. To visualize the parallel corpus, you will need to download this software. Further data-processing can be done immediately on the output files:

  • Luke.tok: English text
  • Lukka.tok: Luganda text
  • aligned.1 ... aligned.150: a description of the word-alignment for each of the 150 sentences.

The annotation work was done By Edina Nalukenge in the context of the OCAPI project (University of Antwerp).

Unsupervised Induction of Dholuo Word Classes using Maximum Entropy Learning

Unsupervised Induction of Dholuo Word Classes using Maximum Entropy Learning, De Pauw, Guy, Wagacha Peter W., and Abade Dorothy A. , Proceedings of the First International Computer Science and ICT Conference (COSCIT 2007), Nairobi, Kenya, (2007)

English - Amharic Glossary


On-line English - Amharic glossary

Somali - English Dictionary


On-line Somali - English Dictionary

Syndicate content