parallel corpus

English - Luganda Parallel Corpus

A parallel corpus consists of the same text in two or more different languages. Word-alignment involves finding the links between the words in the two texts.  A large word-aligned corpus can be used as source material for statistical machine translation techniques and knowledge transfer techniques.

On this page, you can download a small word-aligned parallel corpus Luganda - English. It consists of 150 manually annotated sentences of the gospel of Luke (1:1 until 3:18). The English text is the King James Bible and the Luganda text was taken from the on-line Luganda bible.

Needless to say this is a very modest-size corpus and cannot be used as the only dataset to bootstrap MT research. Its purpose however it to provide a gold-standard test set to evaluate and tune automatic word-alignment techniques for larger parallel corpora English-Luganda.
The files were made using the UMIACS Word Alignment Interface. To visualize the parallel corpus, you will need to download this software. Further data-processing can be done immediately on the output files:

  • Luke.tok: English text
  • Lukka.tok: Luganda text
  • aligned.1 ... aligned.150: a description of the word-alignment for each of the 150 sentences.

The annotation work was done By Edina Nalukenge in the context of the OCAPI project (University of Antwerp).

Syndicate content