parallel corpus
Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili
Submitted by Guy on Tue, 2011-09-20 10:32- Login or register to post comments
- Google Scholar
English - Luganda Parallel Corpus
Submitted by Guy on Thu, 2007-01-18 13:31A parallel corpus consists of the same text in two or more different languages. Word-alignment involves finding the links between the words in the two texts. A large word-aligned corpus can be used as source material for statistical machine translation techniques and knowledge transfer techniques.
On this page, you can download a small word-aligned parallel corpus Luganda - English. It consists of 150 manually annotated sentences of the gospel of Luke (1:1 until 3:18). The English text is the King James Bible and the Luganda text was taken from the on-line Luganda bible.
Needless to say this is a very modest-size corpus and cannot be used as the only dataset to bootstrap MT research. Its purpose however it to provide a gold-standard test set to evaluate and tune automatic word-alignment techniques for larger parallel corpora English-Luganda.
The files were made using the UMIACS Word Alignment Interface. To visualize the parallel corpus, you will need to download this software. Further data-processing can be done immediately on the output files:
- Luke.tok: English text
- Lukka.tok: Luganda text
- aligned.1 ... aligned.150: a description of the word-alignment for each of the 150 sentences.
The annotation work was done By Edina Nalukenge in the context of the OCAPI project (University of Antwerp).
